Skip to main content

OpenAI Unveils GPT-Realtime-2: Voice Model with GPT-5-Level Intelligence for Live Conversations

Ilustrační obrázek pro jarvis-ai.cz
OpenAI is releasing three new real-time voice models — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. The central model GPT-Realtime-2 promises reasoning capabilities at the level of GPT-5 directly in a live call, expands the context window to 128,000 tokens, and can work with tools, interruptions, and topic changes in real time. For Czech developers and companies, the key information is that the models are available immediately via the Realtime API, support data residency in the EU, and Czech is covered by the translation model among 70+ languages.

Listen to this article:

GPT-Realtime-2 thinks in a conversation like GPT-5

ChatGPT has been able to communicate by voice since last year, and Google Gemini offers a similar real-time mode. Until now, however, the models behind these voice interfaces lagged behind their text counterparts — especially compared to reasoning models, which take time to think through a response. According to OpenAI's official announcement, this is no longer acceptable today: a modern voice agent must understand context, react to changes, call tools, and maintain a natural flow of conversation — all at the same time.

The centerpiece of the new family of models is GPT-Realtime-2. OpenAI claims that its reasoning capabilities reach the level of GPT-5. The model is designed for live voice interactions, where it must conduct dialogue, think through requests, call tools, and simultaneously handle user interruptions.

Technically, this is a significant leap forward. The context window expands from 32,000 to 128,000 tokens, enabling much longer and more complex conversations. The model can call multiple tools in parallel and accompany its actions with audible phrases like "allow me to verify that." Short introductory sentences — for example, "one moment" — let the user know the system is working. When something goes wrong, the model no longer remains awkwardly silent, but apologizes: "I'm having trouble with that right now."

OpenAI also emphasizes improved processing of technical terminology, proper names, and medical terms. The tone of voice is better controllable: calming when solving problems, empathetic with frustrated users, positive after a successful action.

Five levels of thinking from minimal to xhigh

Developers can set the intensity of reasoning at five levels: minimal, low, medium, high, and xhigh. The default is "low" to maintain low latency for simple queries. For more complex tasks, more computing power can be engaged. This granularity is important for commercial deployment, where speed and accuracy must be balanced.

On the Big Bench Audio benchmark, GPT-Realtime-2 achieves 96.6% accuracy in "high" mode, while its predecessor GPT-Realtime-1.5 had 81.4%. In the Audio MultiChallenge test, which measures the ability to follow instructions in multi-turn dialogues, the "xhigh" variant performs even better: 48.5% compared to 34.7% in the previous version.

Three interaction patterns for voice AI

OpenAI defines three basic usage patterns that can also be combined:

  • Voice-to-Action: The user describes aloud what they need. The system thinks through the request, calls the right tools, and completes the task — for example, booking a flight or scheduling a meeting.
  • Systems-to-Voice: Software converts context into spoken guidance. A travel app can inform a passenger that despite the delay, they will make their connection, suggest the fastest route to the new gate, and confirm baggage transfer.
  • Voice-to-Voice: AI helps people conduct live conversations across language barriers. Deutsche Telekom is already testing this pattern for customer support.

These features should soon appear in ChatGPT's audio mode as well. OpenAI believes that "voice can now truly become the primary interface."

Translation and transcription as standalone models

In addition to the flagship GPT-Realtime-2, OpenAI is introducing two specialized models:

GPT-Realtime-Translate is a standalone model for live translation. It supports more than 70 input languages and 13 output languages. It preserves meaning and keeps pace with the speaker even during context changes, regional accents, and specialized vocabulary. Czech is among the supported input languages, which opens up possibilities for Czech companies operating in an international environment — customer support, cross-border sales, education, and media.

GPT-Realtime-Whisper is a low-latency streaming model for speech transcription. It targets live captions for meetings, classes, broadcasts, and events. Teams can use it for generating notes and summaries during the conversation, for building voice agents with continuous speech understanding, or for accelerating subsequent workflows in customer support, healthcare, sales, and recruitment.

Prices and availability for the Czech market

All three models are available immediately via the Realtime API and can be tested in the OpenAI Playground. Pricing structures vary by model:

  • GPT-Realtime-2: $32 per million audio input tokens ($0.40 for cached tokens) and $64 per million audio output tokens.
  • GPT-Realtime-Translate: $0.034 per minute.
  • GPT-Realtime-Whisper: $0.017 per minute.

For Czech and European developers, the key information is that the Realtime API supports data residency in the EU. Data from EU-based applications thus remains on European servers, which is important in the context of GDPR and growing demands for data sovereignty. OpenAI adds that the service is subject to enterprise privacy commitments.

For ordinary users in the Czech Republic, the new models are not yet available directly in the free version of ChatGPT. They should reach ChatGPT's audio mode in the coming weeks, with priority given to subscribers of higher-tier plans. However, developers and companies can start experimenting immediately via the API.

Comparison with the competition

Google Gemini offers a similar real-time conversational mode, but has not yet published precise benchmarks of its voice models against text models. Anthropic focuses more on text models with long context and does not yet have a comparable real-time voice platform. Meta, with the Llama family of models, is pushing on open-source, but does not offer a real-time voice API at the level of production deployment.

OpenAI thus currently holds the lead in the field of integrated voice intelligence with advanced reasoning. The key question for the coming months will be how quickly this technology reaches end-user products — and whether Czech companies find a way to utilize it.

Is GPT-Realtime-2 available for free in ChatGPT?

No, not yet. GPT-Realtime-2 is available only through the paid Realtime API for developers. It should reach ChatGPT's audio mode in the coming weeks, likely first for subscribers of higher-tier plans. The free version of ChatGPT still uses older voice models.

Can a Czech company use GPT-Realtime-Translate for customer support?

Yes. Czech is among the 70+ supported input languages of the GPT-Realtime-Translate model. A company can use the Realtime API with support for data residency in the EU, which means call recordings remain on European servers. However, it is necessary to ensure full GDPR compliance, including informed consent from customers for processing voice data by artificial intelligence.

How does GPT-Realtime-Whisper differ from the original Whisper model?

The original Whisper is primarily a transcription model for recordings — it works with files. GPT-Realtime-Whisper is designed for low-latency streaming in real time. It transcribes speech as it is spoken, enabling live captions, immediate meeting notes, and continuous understanding for voice agents. The price of $0.017 per minute makes it affordable for continuous operation.

X

Don't miss out!

Subscribe for the latest news and updates.