Why inference speed is critical
Large language models are ubiquitous today. They help with programming, writing texts, data analysis, and controlling autonomous agents. The problem arises when we want to run them locally — on a personal computer, laptop, or mobile phone. Standard inference is bound by memory bandwidth, not by raw computing power. This means that the processor or graphics card spends most of its time moving billions of parameters from memory to compute units just to generate a single token — one word or part of it.
This phenomenon is especially noticeable on common hardware. Even powerful graphics cards such as the NVIDIA RTX PRO 6000 or consumer Apple Silicon often reach only a fraction of their theoretical performance because they are limited by data transfer. The result? High latency, slow responses, and frustrated users who expect an interactive experience.
How speculative decoding works
The technique that Google is now implementing for Gemma 4 is not entirely new. Google researchers first introduced it back in 2022 in the paper Fast Inference from Transformers via Speculative Decoding. The principle is surprisingly simple: instead of the main model generating token by token, we add a lightweight "drafter" model that predicts several future tokens at once.
Imagine it as an author and a proofreader. The author (drafter) writes a paragraph quickly and instinctively. The proofreader (main model) then checks the entire text and either approves it or corrects it. If approved, we get several tokens "for free" — in the time the main model would normally need to generate one. And if the drafter misses, the main model simply fixes the error and continues. Thanks to this, there is no degradation of quality in the output.
Google has taken this technique further with the Gemma 4 models. The MTP drafters share KV cache and internal activations with the main model, meaning they do not have to recompute context that the large model has already processed. For the E2B and E4B edge models, the developers also implemented efficient clustering in the embedder, which speeds up the final logit computation — a typical bottleneck in smaller models.
Specific numbers and supported platforms
According to official Google measurements, MTP drafters for Gemma 4 deliver up to 3× acceleration compared to standard inference. Tests were conducted on the LiteRT-LM, MLX, Hugging Face Transformers, and vLLM platforms. Specifically, the Gemma 4 26B model on an NVIDIA RTX PRO 6000 graphics card achieves approximately double the tokens per second with the drafter while maintaining identical output quality.
The results on Apple Silicon are also interesting. While with the 26B MoE (mixture-of-experts) model at a batch size of 1 the routing challenges are limiting, when processing multiple requests simultaneously (batch size 4–8) the speed increases by approximately 2.2×. Developers are observing similar acceleration on server cards such as the NVIDIA A100.
Comparison with competition and market context
Speculative decoding is gradually becoming standard. Similar techniques are also used by DeepSeek models or implementations within Meta's Llama ecosystem. However, Google now offers a native, officially supported, and optimized variant directly for its flagship open-source family. This is important especially for developers who prefer proven solutions from major players with long-term support.
Gemma 4 was introduced only a few weeks ago and has already recorded over 60 million downloads. The model is designed to offer maximum intelligence per parameter — from mobile devices through workstations to the cloud. With the arrival of MTP drafters, this promise is pushed even further. While the competition often emphasizes pure benchmark performance, Google focuses on practical usability in real-world conditions.
What this means for Czech developers and companies
For the Czech market, the key point is that Gemma 4 and its MTP drafters are available under the Apache 2.0 license. This means that developers, startups, and large companies can use them commercially without licensing fees and without fear of vendor lock-in. In the context of the upcoming implementation of the EU AI Act, running open-source models locally is also advantageous from a compliance perspective — data does not leave corporate infrastructure and organizations have full control over how the model processes sensitive information.
Availability in popular tools such as Ollama, vLLM, SGLang, or MLX means that Czech developers do not need to change their existing infrastructure. Simply download the model weights from Hugging Face or Kaggle and integrate them into existing pipelines. For mobile applications, the models can be tried directly in the Google AI Edge Gallery app for Android or iOS.
Practical use and the future
MTP drafters are not just an academic curiosity. Their impact will be visible in several areas:
Programming assistants: Faster code generation means a smoother workflow. The developer receives suggestions almost instantly, which increases productivity and reduces concentration interruptions.
Voice agents and chatbots: Latency is key to natural conversation. With MTP drafters, we are approaching real time even on common hardware.
Autonomous agents: Agents requiring fast multi-step planning benefit from every saved millisecond. Faster inference enables more complex tasks in less time.
Edge and mobile devices: Shorter generation time directly saves battery. Smartphone and tablet users can thus use advanced AI offline and for longer.
Google has also published a technical breakdown that explains in detail the architecture of the drafters, KV cache sharing, and embedder optimization. For developers who want to understand the details, it is a valuable resource.
Do I need special hardware to use MTP drafters?
No. MTP drafters work on common hardware — from Apple Silicon through consumer NVIDIA cards to servers with A100. Optimization varies by platform, but the principle is the same everywhere.
Is any response quality lost when using the drafter?
No. The main Gemma 4 model always has the final say — every token generated by the drafter is verified. If the drafter misses, the main model corrects the error. The resulting quality is identical to standard inference.
How do I get started with MTP drafters in Ollama or vLLM?
Simply download the MTP drafter weights for your Gemma 4 model from Hugging Face and follow the official documentation at ai.google.dev/gemma/docs/mtp. Support is integrated directly into popular frameworks, so activation usually requires only a modification of the inference configuration.