Skip to main content

DeepSeek V4: Million-Token Context Is an Inference Systems Problem, Not a Model Problem

DeepSeek V4 AI model with million-token context - article illustration for ai-jarvis.eu
DeepSeek V4 is the latest model from Chinese AI company DeepSeek, featuring an extraordinary 1-million-token context window. While competitors showcase benchmark tables, V4's real story lies elsewhere: the architecture radically transforms inference requirements. Together AI, which runs the model on NVIDIA Blackwell GPUs, now explains in detail why deploying V4 is more of a systems problem than a model performance question.

What does DeepSeek V4 bring?

DeepSeek V4 Pro is a massive model with 1.6 trillion parameters, of which approximately 49 billion are activated during each forward pass (Mixture of Experts — MoE architecture). That alone isn't unusual among top-tier models — competitors like Qwen3.5-397B-A17B or Llama 4 Maverick use similar principles.

What makes V4 exceptional is its hybrid attention architecture. DeepSeek combines three different mechanisms:

  • Compressed Sparse Attention (CSA) — compresses context with stride 4, where each entry summarizes 8 neighboring tokens. Queries select approximately 128 compressed entries, providing a fine-grained sparse path into selected regions of the million-token prefix.
  • Heavily Compressed Attention (HCA) — same principle with stride 128. At 1M tokens, the cache shrinks from 1 million positions to just 8,000 compressed entries. The model gains a coarse global overview of the entire context at once.
  • Sliding Window Attention (SWA) — preserves an exact local path for a short window (128 tokens), ensuring the model doesn't lose detailed awareness of recent context.

This combination reduces KV cache to just 10% compared to the previous DeepSeek V3.2, and inference FLOPs to 27%. The model was pre-trained on 32 trillion tokens using the Muon optimizer, followed by a two-stage post-training pipeline.

Benchmarks: Where DeepSeek V4 excels

According to Together AI data, DeepSeek V4 Pro achieves impressive results:

  • 93.5% on LiveCodeBench — excellent coding performance approaching specialized models like GPT-OSS or Claude
  • 90.1% on GPQA Diamond — demanding scientific and professional reasoning benchmark
  • 80.6% on SWE-Bench Verified — real-world software engineering, bug fixing in code

For comparison, the previous generation DeepSeek V3.1 scores significantly lower on agentic tasks where V4 excels thanks to the combination of long context and hybrid attention. Competitor Claude 4 Sonnet from Anthropic scores around 72–75% on SWE-Bench, while DeepSeek V4 Pro at 80.6% pushes the bar significantly higher for open-source models.

The KV cache problem

The key engineering insight from the Together AI article is: DeepSeek V4 attacks the KV cache problem from a new direction. Earlier techniques — Group Query Attention (reducing KV heads), Multi-Head Latent Attention (compressing into latent representation), FP8/MXFP4 (smaller data types per element) — all addressed different terms of the same equation. DeepSeek V4 goes after the token axis itself: compressing context before KV cache storage.

On the surface, this sounds like a clear victory. Reality is more complex. During initial deployment on NVIDIA HGX B200, Together AI engineers discovered that serving capacity wasn't limited by compressed cache (CSA/HCA) but by how the engine handled local SWA cache. A full SWA implementation actually had a higher per-token footprint than the previous generation V3 — approximately 3.8 KB per token versus 3.4 KB.

The real gain came from cache policy: by keeping only the SWA states most likely to be reused, they increased total KV cache capacity on a single B200 node from approximately 1.2M tokens to 3.7M tokens. That is the main lesson: V4's architecture creates the opportunity for long-context efficiency, but the realized capacity depends on how the inference engine stores, recomputes, and evicts different cache types.

Implications for developers and businesses

For European AI teams considering DeepSeek V4 deployment, the key insight is: the same weights need different serving profiles depending on the task. Together AI identifies several regimes:

  • Long context, agents, and coding assistants — this is where V4 shines. They read massive caches during decode, so compressed KV cache, batching, and prefix reuse make the most sense. For agentic tasks like coding over entire repositories or research assistants, the model shifts the cost model from "price per token" to "cost per completed task".
  • Short conversations and chat — V4 doesn't excel here yet. Short contexts don't benefit from compressed cache advantages and instead suffer from less mature kernel paths for CSA/HCA. For standard chatbot use, established models like Claude, GPT, or DeepSeek V3.1 remain better choices.
  • RL rollouts — reinforcement learning and long trajectories have their own optimization targets: cost per long trajectory, not single-query latency.

The model is available through the Together AI API at $2.10 per million input tokens ($0.20 on cache hit) and $4.40 per million output tokens. This is competitive pricing for a model of this size — comparable to DeepSeek V3.1 ($0.60/$1.70) while offering significantly higher long-context capabilities. The model is distributed under MIT license, fully open and usable for commercial projects.

Prefix caching as a storage policy

One of the most interesting V4 innovations is its prefix caching approach. In standard models, the rule is simple: shared prefix = shared KV cache. With V4, the question becomes: which cache?

A shared prefix contains CSA state, HCA state, SWA state, and uncompressed tails used by compressors. CSA and HCA are compact and easy to store. SWA is exact local state that becomes expensive to store, especially when the cache moves beyond GPU memory.

The DeepSeek technical paper describes three strategies:

  1. Store full SWA cache — simple but grows proportionally with context length
  2. Store periodic SWA checkpoints — saves state every K tokens, recomputes the gap on cache hit
  3. Recompute SWA on hit — stores only CSA/HCA and rebuilds SWA on prefix reuse. Cost is bounded by window size times layer count: at 128 tokens and 61 layers, roughly 8K tokens of recompute — negligible against a 1M prefix

Together AI uses the first strategy (full store) in its initial deployment — a pragmatic decision that keeps prefix reuse straightforward while the rest of the serving stack matures.

What to benchmark before deploying V4

If you're considering migration to DeepSeek V4, Together AI recommends benchmarking four areas:

  • Context regime — how long are your actual contexts? Below 100K tokens, V4's advantages won't fully materialize.
  • Prefix reuse — how often do queries overlap? Critical for agentic tasks with repeating context.
  • Cache policy — store vs recompute SWA on prefix hit. Depends on prefix length and reuse frequency.
  • Endpoint profile — same weights, different profile. Long-context agents need larger tensor-parallel groups, batching, and different eviction policies than chat workloads.

Summary

DeepSeek V4 is not just another model with better benchmarks. It represents an architectural shift that demonstrates the future of AI belongs to models designed from the ground up for efficient production deployment. The million-token context ceases to be a marketing number and becomes a practically usable reality — but only when the inference engine knows how to properly handle the new architecture.

For European developers and companies, DeepSeek V4 is available via the Together AI API at competitive prices and under the MIT license. If you're working on projects requiring analysis of extensive documents, coding over entire repositories, or running AI agents, DeepSeek V4 deserves your attention — with the understanding that its real performance will only show in your specific workload and serving configuration.

Can DeepSeek V4 be deployed locally, or is it cloud-only?

DeepSeek V4 is an MIT-licensed open-source model, so local deployment is theoretically possible. Practically, it's extremely demanding — the model has 1.6 trillion parameters and requires cutting-edge hardware (at minimum several NVIDIA Blackwell GPUs in a cluster). For most teams, the API approach via Together AI or other providers is the only realistic option.

Does DeepSeek V4 have EU data protection compliance?

As an open-source model under MIT license, DeepSeek V4 can be deployed on infrastructure that complies with GDPR and other EU regulations. When using the Together AI API, data processing follows Together AI's terms — check whether the provider offers EU-based processing regions for GDPR compliance. The model's open nature gives European companies flexibility to choose their deployment strategy according to regulatory requirements.

Does it make sense to switch from DeepSeek V3.1 to V4 for everyday tasks?

For regular chat, translation, or short texts — no. DeepSeek V3.1 is comparably capable in these tasks and cheaper ($0.60/$1.70 vs $2.10/$4.40 per million tokens). V4 makes sense where you need to analyze extensive documents (hundreds of thousands of tokens), code over entire repositories, or run complex AI agents. For short contexts, V4's advantages are minimal while output token pricing is significantly higher.

X

Don't miss out!

Subscribe for the latest news and updates.