Back
Technology

Microsoft AI Unveils Three New Foundational Models for Text, Voice, and Image Generation

View source

Microsoft AI Unveils Three New Foundational Models for Text, Voice, and Images

Microsoft AI, the company's research division, has announced the release of three new foundational artificial intelligence models capable of generating text, voice, and images. This initiative expands Microsoft's multimodal AI capabilities, positioning it competitively while reaffirming its collaboration with OpenAI.

Introducing Microsoft AI's New Generative Models

The newly introduced models signify Microsoft's deep commitment to advancing its generative AI capabilities. These models are:

  • MAI-Transcribe-1 for high-speed speech-to-text conversion.
  • MAI-Voice-1 for advanced audio generation.
  • MAI-Image-2 for innovative video content creation.

Model Specifics and Capabilities

MAI-Transcribe-1

This model is engineered for efficient speech-to-text conversion, supporting 25 different languages. Microsoft highlights its impressive speed, stating it operates 2.5 times faster than their existing Azure Fast offering.

MAI-Voice-1

An innovative audio-generating model, MAI-Voice-1 allows users to create 60 seconds of audio in just one second. It also features the capability to generate custom voices, offering significant flexibility for various applications.

MAI-Image-2

Developed specifically for visual content, MAI-Image-2 is designed for the generation of video content, marking Microsoft's entry into the text-to-video space with its own foundational model.

Availability and Development

MAI-Image-2 made its debut on MAI Playground, Microsoft's new large language model testing platform, on March 19. Currently, all three models are accessible through Microsoft Foundry, with MAI-Transcribe-1 and MAI-Voice-1 also available on MAI Playground.

These cutting-edge models were developed by the MAI Superintelligence team at Microsoft AI. This dedicated research group, established in November 2025, operates under the leadership of Mustafa Suleyman, the CEO of Microsoft AI.

Strategic Vision and Competitive Pricing

Mustafa Suleyman has articulated Microsoft AI's overarching objective: to build "Humanist AI." This vision emphasizes human-centered design, optimization for diverse communication methods, and training for practical, real-world applications. The company plans to introduce more models via Foundry and integrate them directly into Microsoft products and experiences.

Within the rapidly evolving large language model market, Microsoft AI intends for these new models to serve as a more cost-effective alternative to offerings from competitors like Google and OpenAI.

Competitive Pricing Structure

Initial pricing details for the new models are as follows:

  • MAI-Transcribe-1: Starting at $0.36 per hour.
  • MAI-Voice-1: Starting at $22 per 1 million characters.
  • MAI-Image-2: Starting at $5 for 1 million text input tokens and $33 for 1 million image output tokens.

Continuing the OpenAI Partnership

Despite developing its own foundational models, Microsoft has unequivocally reaffirmed its commitment to its ongoing partnership with OpenAI. Suleyman indicated that a recent renegotiation of this partnership reportedly facilitated Microsoft's ability to pursue its superintelligence research initiatives.

Microsoft's significant investment of over $13 billion in OpenAI underscores the depth of this collaboration, with OpenAI's models integrated across Microsoft's product ecosystem. The company also employs a dual strategy for chips, both producing its own and sourcing from external suppliers, ensuring robust infrastructure for its AI ambitions.