OpenAI Launches Three Voice AI Models: GPT-Realtime-2 Brings GPT-5 Reasoning to Voice

May 7, 2026 Daniel Cesak

OpenAI today announced the launch of three new audio models in its API, pushing voice assistants from simple commands to full conversational intelligence. The star of the show is GPT-Realtime-2 — the first voice model with reasoning at the GPT-5 level, capable of handling complex queries and naturally maintaining conversation context. Alongside it comes GPT-Realtime-Translate for live translation from 70+ languages and GPT-Realtime-Whisper for instant speech transcription.

Three Patterns for Voice AI: From Voice to Action

OpenAI identified three main scenarios that developers use to build voice applications. The first is voice-to-action, where the user says what they need and the system executes it — searches for real estate, schedules a meeting, or orders food. Zillow is already testing an assistant that, based on the voice request "find me homes in my budget, avoid busy streets, and arrange a viewing for Saturday," can independently search, filter, and book an appointment.

The second scenario, systems-to-voice, reverses the information flow: software itself actively communicates with the user. A travel app can thus alert you that "your arrival flight is delayed, but you'll make the connection — I've found a new gate and the fastest route through the terminal."

The third pattern, voice-to-voice, enables seamless conversation across languages. Deutsche Telekom is testing the model for customer support, where each person speaks their preferred language and translation happens in real time. Priceline is working on a future where travelers manage their entire vacation by voice — from searching for flights and hotels to handling changes when a flight is delayed.

GPT-Realtime-2: What the New Voice Model Can Do

GPT-Realtime-2 is not just a faster version of the previous model. It is OpenAI's first voice model to leverage reasoning at the GPT-5 level. This means that during a conversation, it can simultaneously think about the question, call external tools, handle corrections and interruptions — and respond appropriately to the situation.

Key new features include:

Preambles: Developers can enable short phrases like "one moment, let me look into that" or "let me verify this" so the user knows the agent is working on the request.
Parallel tool calls: The model can call multiple tools simultaneously and announce its actions by voice ("checking the calendar," "searching for that").
Better error recovery: Instead of silently failing, it says "I'm having trouble with this right now" and continues.
Longer context window: From 32K to 128K tokens for longer and more complex conversations.
Stronger domain understanding: It better remembers specialized terminology, proper names, medical terms, and other specific vocabulary.
Adjustable reasoning effort: Developers choose from five levels — minimal, low, medium, high, and xhigh — depending on whether it's a simple question or a complex task. The default is low for faster response.

Benchmarks: How Much Better Is It?

GPT-Realtime-2 at the high level achieves a 15.2% better score in Big Bench Audio (which tests reasoning abilities in models with audio input) compared to the previous version, GPT-Realtime-1.5. At the xhigh level, it surpasses its predecessor by 13.8% in Audio MultiChallenge — a test that evaluates multi-turn conversational intelligence, including instruction following, context integration, and handling natural speech corrections.

In practice, this is confirmed by Zillow: on their most difficult test scenario, call success rate rose from 69% to 95% after prompt optimization, representing a jump of 26 percentage points. Josh Weisberg, head of AI at Zillow, stated: "The combination of agentic capabilities and the reliability of guardrails is what makes GPT-Realtime-2 viable for production deployment."

GPT-Realtime-Translate: Live Translation for 70+ Languages

The second model targets global communication. GPT-Realtime-Translate supports more than 70 input languages and translates into 13 output languages. Translation happens in real time, so the conversation remains fluid — the model handles regional accents, context changes, and domain-specific language.

Deutsche Telekom is testing the model for multilingual voice interactions in customer support. Vimeo demonstrated how Realtime-Translate can live-translate product videos — global customers hear the news in their own language without waiting for a separately produced version. Indian startup BolnaAI, which builds voice AI for India's linguistically diverse market, measured 12.5% lower word error rate across Hindi, Tamil, and Telugu compared to any other tested model.

For the Czech Republic, the crucial question is Czech language support. Although OpenAI has not published the complete list of supported languages, with more than 70 input languages, it is almost certain that Czech is among them. For the 13 output languages, support will be more selective — likely limited to major world languages. OpenAI will publish the exact list in the API documentation.

GPT-Realtime-Whisper: Transcription That Keeps Pace

The third model is designed for low-latency streaming speech-to-text transcription. GPT-Realtime-Whisper transcribes while the person speaks — captions appear instantly, meeting notes are created during the conversation, and voice agents understand continuously.

Use cases range from captioning live broadcasts and education to automatic meeting minutes and faster downstream workflows in customer support, healthcare, sales, or recruitment.

Safety and European Data

OpenAI has deployed several layers of protection. Active classifiers monitor sessions in the Realtime API and can stop a conversation if they detect violations of rules for harmful content. Developers can add their own protective mechanisms via the Agents SDK. Corporate policies prohibit using outputs for spam, deception, or other harmful purposes.

For European companies, it is essential that the Realtime API fully supports EU Data Residency — data can thus remain in European data centers. The models are also covered by enterprise privacy commitments.

Pricing and API Availability

All three models are available through OpenAI's Realtime API. Prices are as follows:

GPT-Realtime-2: $32 per million audio input tokens ($0.40 for cached inputs), $64 per million audio output tokens
GPT-Realtime-Translate: $0.034 per minute
GPT-Realtime-Whisper: $0.017 per minute

For context: one hour of real-time translation via GPT-Realtime-Translate costs approximately $2 (about 45 CZK). Compared to a human translator, this is an order of magnitude lower price — while maintaining instant response.

The models can be tested in the OpenAI Playground and developers can start building via Codex CLI. For Czech developers and companies, the API is fully available immediately under standard terms — no geographic restrictions apply.

What This Means for the Czech Environment

With its models, OpenAI addresses three areas where Czech companies have so far hit limits. Voice agents in Czech — thanks to GPT-Realtime-2 with better understanding of names and domain terminology, Czech banks, insurance companies, or e-shops can build voicebots that converse naturally, understand context, and can even call internal systems. Multilingual support via Realtime-Translate opens doors for companies that serve foreign clients or expand abroad. And instant meeting transcription via Realtime-Whisper makes work easier for teams operating across time zones or needing accurate meeting minutes.

Czech companies already using the OpenAI API can deploy the new models immediately. The only remaining barrier is price at high volumes — but comparison with alternatives (human staff, traditional translation services) still clearly favors AI.

Does GPT-Realtime-Translate support Czech?

OpenAI states support for 70+ input languages, so Czech is very likely among them. For the 13 output languages, support will be more selective — OpenAI will publish the complete list in the official API documentation. For certainty, we recommend testing in the Playground.

How much does deploying GPT-Realtime-2 for a corporate voicebot cost?

The price depends on token volume. For a typical corporate conversation (approx. 5 minutes, hundreds of tokens for input and output), a single call ranges in the order of single cents. For an exact calculation, it is necessary to know the call volume and their length — OpenAI recommends starting with a smaller deployment and scaling as needed.

Is GPT-Realtime-2 also available in ChatGPT for regular users?

Not yet. The models are currently available only via API for developers. However, OpenAI has made a demo available on its website where GPT-Realtime-2 can be tried out. When and if the model will come to ChatGPT itself, the company has not announced.