One API endpoint instead of three vendors
Building a voice AI agent previously meant connecting three different services: speech-to-text (STT), large language model (LLM), and text-to-speech (TTS). Developers had to deal with latency between individual steps, synchronize billing, and implement logic for interruption detection or conversation interruption themselves. AssemblyAI now packages this into a single API.
Voice Agent API works via a single WebSocket: you stream audio in, you get audio out. No SDK to install, no proprietary event format — just JSON messages that a developer understands after ten minutes of reading documentation. According to AssemblyAI, most developers deploy a functioning agent on the same day they start.
The platform is built on the Universal-3 Pro Streaming model, which AssemblyAI calls the most accurate streaming speech-to-text model on the market. In internal testing on alphanumeric characters — that is, combinations of letters and numbers such as account numbers, drug codes, or email addresses — it achieved an error rate of only 16.7%, while OpenAI GPT-4o Realtime API had 23.3% and Amazon Nova-3 25.5%. This is not a marginal difference: when a voice agent mishears a sixteen-digit order number, the conversation ends in customer frustration regardless of the quality of the language model behind it.
Listening is harder than speaking
The biggest innovation of Voice Agent API does not lie in the speed of response generation, but in how well the system listens. In an April 2026 AssemblyAI survey, 76% of respondents ranked speech-to-text accuracy as the most important factor when building voice agents — even above latency, price, and integration simplicity.
The reason is simple: if the transcription engine poorly captures a patient's name, a drug name, or an invoice number, the LLM responds to incorrect input. The error cascades and multiplies throughout the entire chain. As the AssemblyAI team puts it: "Garbage in, garbage out."
Therefore, the platform contains several elements that address real conversational situations:
- Intelligent sentence-end detection: Server-side speaker transition detection distinguishes whether the user has simply paused briefly or has actually completed their thought. It can be configured according to the type of conversation — fast IVR vs. long clinical interview.
- Interruption without silence: When the user interrupts the agent mid-sentence, the system immediately stops speaking and begins listening again. No "talking over each other," no awkward silence.
- Tool calling without silence: When the agent calls an external function (for example, order verification in a database), the conversation does not fall into dead silence. The system maintains the flow of dialogue even during backend operations.
- Session recovery: If the WebSocket drops, you can reconnect within 30 seconds and continue exactly where the conversation left off — context is preserved.
- Live configuration: System prompt, available tools, or detection settings can be changed mid-call without needing to restart the session.
A price that doesn't surprise
AssemblyAI chose a flat-rate model for Voice Agent API: 4.50 USD per hour of conversation (0.075 USD per minute). This means one bill for everything — speech recognition, language model inference, and voice synthesis. No token fees, no concurrency limits, no surprises when scaling.
For comparison: building your own STT + LLM + TTS chain can be cheaper at very low volumes, but engineering, billing, and latency costs quickly outweigh the savings. With competing orchestrators like Vapi or Pipecat, the price consists of several variables — per-minute STT rate, token fees for LLM, and per-minute TTS rate.
AssemblyAI also offers 50 USD in free credits to start without needing a credit card. For startups and developers who want to test a concept, this represents several hours of full-fledged free testing. For large volumes, an enterprise plan with individual rates is available.
Who is Voice Agent API for
The platform is not a template for a ready-made chatbot, but infrastructure on which a team builds its own product. AssemblyAI lists several specific scenarios:
- Contact centers: Automation of ticket routing based on call content, not just keywords.
- Healthcare: Clinical intake that correctly captures drug names and allergies on the first try.
- Sales training: Tools that identify the moment when a salesperson mishandled a customer objection.
- Language education: Applications providing instant feedback in multiple languages.
The advantage is that the API uses standard JSON schemas for tool calling — developers integrate their own business logic directly into the conversation flow without needing to learn a proprietary format.
Availability in Czech and for the Czech market
Here it is necessary to be honest: AssemblyAI Voice Agent API runs on the Universal-3 Pro Streaming model, which currently explicitly supports English, Spanish, German, French, Italian, and Portuguese. Czech is not yet among them. This means that for Czech companies and developers, Voice Agent API is not ready for production deployment in the native language.
On the other hand, AssemblyAI offers the Universal-2 model with support for 99 languages including Czech, even in real-time mode. For Czech developers who need a Czech voice agent, it therefore remains more reasonable for now to assemble their own pipeline using AssemblyAI STT for Czech and an external LLM and TTS. AssemblyAI itself states that Voice Agent API is primarily for teams that want "the whole pipeline in one integration" — and that is not yet offered in Czech.
From the perspective of GDPR and EU regulation, AssemblyAI holds SOC 2 Type II certification, supports EU Data Residency, and offers HIPAA BAA for healthcare purposes. For Czech companies operating in regulated industries (healthcare, finance), this is an important advantage over solutions without European data guarantees.
Comparison with competition
The voice AI agent market is filling up quickly. Besides AssemblyAI Voice Agent API, the following are worth mentioning:
- Vapi: Specialized orchestrator for voice agents with support for multiple STT, LLM, and TTS providers. More flexible, but requires more complex setup and billing is variable.
- OpenAI Realtime API: Integrated solution from OpenAI with low latency, but higher error rate on alphanumeric characters and higher price at larger volumes.
- Deepgram Voice API: Strong in real-time transcription with its own models, but a complete voice agent pipeline requires external orchestration.
- LiveKit / Pipecat: Open-source orchestrators that allow combining various STT, LLM, and TTS services — ideal for teams that want full control over every layer.
AssemblyAI claims the position of the most accurate solution in production environments — especially where every word, number, or name matters.
Verdict: Infrastructure meant to disappear into the background
AssemblyAI Voice Agent API is not for developers who want to play with individual layers of AI speech. It is for teams that want to build a product — and don't want to spend months fine-tuning turn detection or solving edge cases during interrupted calls.
The flat price of 4.50 USD per hour, accuracy on real data, and integration simplicity make Voice Agent API an interesting choice for startups and enterprise teams in English-speaking markets. For Czech developers, Czech remains a significant limitation that they should consider before deciding on deployment. But if you are planning an international product or English-speaking customer support, AssemblyAI deserves attention.
Do I need a special SDK or framework for Voice Agent API?
No. Voice Agent API uses plain WebSocket and standard JSON messages. You don't need to install any SDK or learn a proprietary event format. According to AssemblyAI, most developers set up a functioning agent within one afternoon. The API is even designed to work end-to-end with Claude Code — just paste the documentation into the terminal and have the integration generated.
How does session recovery work when the connection drops?
If the WebSocket drops, you have 30 seconds to reconnect. After re-establishing the connection, the agent continues exactly where the conversation left off — including all context, history, and settings. This is critical for production deployment, where network outages are not exceptions, and significantly reduces user frustration during interrupted calls.
Can I use Voice Agent API with my own LLM or TTS model?
Voice Agent API is a closed end-to-end pipeline built on AssemblyAI's own models — STT, LLM, and TTS are integrated under one price. If you need to use a specific external model (for example, Claude 4 or a custom TTS), AssemblyAI recommends using their Streaming Speech-to-Text API along with orchestrators like LiveKit or Pipecat. The article "When to use Voice Agent API vs. Universal-3 Pro Streaming" on the AssemblyAI blog will help with the decision.