Why aren't GPUs enough?
Graphics cards have dominated AI infrastructure because they are extremely efficient at parallel processing of mathematical operations. However, modern AI systems — especially agentic ones — have two completely different processing phases with different hardware requirements.
The first phase is called prefill: the model reads the entire input prompt, processes the context, and creates a so-called KV cache (key-value cache). This operation is extremely parallel — and GPUs handle it excellently. The second phase, decode, generates output tokens one by one. This is a sequential operation that is critically dependent on memory bandwidth and latency — and this is where GPUs hit their limits.
“GPUs are very good at parallelizing matrix math for input processing. They are not good for decoding, especially for latency-sensitive workloads,” says McGonnell, VP of Product at SambaNova. This problem is significantly exacerbated in agentic AI, where the model not only generates text but continuously calls external tools, APIs, databases, or executes code — all while waiting for responses.
Triple Architecture: Each Chip Does What It Excels At
The solution proposed by SambaNova and Intel is heterogeneous inference — dividing tasks among three types of processors according to their natural strengths:
GPUs for Prefill
Graphics cards remain in place for the first phase — processing the input prompt. They are indispensable here: their massively parallel architecture allows for rapid processing of even very long contexts and the creation of a KV cache, which is then passed to other layers of the system.
SambaNova RDU for Decode
The new SN50 RDU (Reconfigurable Dataflow Unit) fifth generation chip takes over the most demanding part of inference — sequential token generation. SambaNova claims that compared to Nvidia Blackwell B200, it achieves:
- 5× higher peak speed when working with the Llama 3.3 70B model
- 3× higher throughput for agentic inference workloads
- 8× lower costs for inference when working with the GPT-OSS-120B model
The chip utilizes a three-layer memory architecture combining high-volume memory, high-performance HBM, and ultra-fast on-chip SRAM. A key advantage is the Dataflow architecture, which maps model execution directly onto the processor — significantly reducing unnecessary data transfers, which are the main "energy hog" for GPUs.
One SambaRack SN50 rack contains 16 SN50 chips and the system can scale up to 256 accelerators connected together. Average consumption is 20 kW per rack in air-cooled data centers. The chip supports models up to 10 trillion parameters and context windows up to 10 million tokens.
Intel Xeon 6 for Agentic Coordination
Intel Xeon 6 processors serve as the "brain" of the entire system. They are not about raw computational power for AI — they handle the orchestration of agentic tasks: calling tools and APIs, distributing workloads, compiling and executing code, validating results, and overall system behavior management. This layer is crucial for coding agents and other autonomous AI systems where the model continuously switches between generating text and initiating real-world actions.
Intel bets on the stability of the x86 ecosystem — decades of software, tools, and knowledge available to developers. The focus is shifting from maximum performance on paper towards system-wide efficiency and cost per processed workload.
Why is this being discussed now?
The reason is simple: agentic AI is growing rapidly. Systems like coding agents or multi-step AI assistants for enterprises require continuous inference loops — the model generates steps, runs a tool, reads the result, generates further steps. This mode of operation dramatically strains GPUs, which were not originally designed for this pattern.
At the same time, businesses report rising inference costs, capacity issues, and underutilization of parts of their GPU infrastructure — because GPUs are waiting instead of working. Patrick Moorhead, CEO of the analytical firm Moor Insights & Strategy, aptly summarized it: “We have reached a point where heterogeneous computing is the right path.”
Availability and Relevance for Czech Companies
The complete solution — a combination of GPU, SN50 RDU, and Xeon 6 in a single infrastructure — will be available in the second half of 2026. It targets large enterprises, cloud providers, and so-called sovereign AI programs, i.e., state or national AI initiatives that want to run agentic workloads on their own infrastructure.
For Czech and European companies, the perspective on sovereign AI is particularly relevant: EU countries, including the Czech Republic, are investing in their own AI capacity to avoid complete dependence on American cloud platforms. The architecture proposed by SambaNova and Intel could be a suitable foundation for such deployments — especially where sensitive data processing is required without moving to a foreign cloud.
Direct availability of SambaNova products in the Czech Republic is not yet confirmed — the company operates primarily through enterprise channels and cloud partners. Intel Xeon 6 processors are, of course, standardly available on the global market, including in the Czech Republic.
End of the "one chip for everything" era?
The partnership between SambaNova and Intel is part of a broader trend: AI hardware is ceasing to be monolithic. Similar to the history of the server market, where specialized chips emerged for networking, data storage, or encryption, AI inference is also dividing into specialized components. GPUs will remain strong for training and prefill, but specialized architectures will increasingly compete for decode and agentic orchestration.
Nvidia currently holds a dominant position — and that's why numbers like "5× faster than B200" or "8× lower costs" are so interesting. If SambaNova delivers on these numbers in production conditions (not just in lab tests), the balance of power in AI data centers could truly begin to shift.
What is an RDU chip and how does it differ from a GPU?
RDU (Reconfigurable Dataflow Unit) is a specialized AI accelerator developed by SambaNova. Unlike GPUs, which are designed for massive parallel operations (such as training or the prefill phase of inference), RDU is optimized for sequential token generation (the decode phase) with low latency and high memory efficiency. The key is the Dataflow architecture, which maps computations directly onto the chip and minimizes unnecessary data transfers.
What does "agentic AI" mean and why does it need different hardware?
Agentic AI refers to systems that not only answer questions but autonomously plan and execute actions — running code, calling APIs, searching databases, compiling results, and repeating this cycle. Unlike simple chatbots, they require constant switching between text generation and real-world operations. This pattern burdens GPUs inefficiently — which is why SambaNova and Intel propose specializing each layer for a different type of processor.
When will this solution be available and for whom is it intended?
Production availability is planned for the second half of 2026. The solution primarily targets large enterprises, cloud providers, and national (sovereign) AI programs. For smaller companies or individual developers, cloud services built on this infrastructure will be more relevant than direct hardware purchase.