Scientists managed to shrink AI context 16× without major loss of accuracy. It paves the way for cheaper models

June 16, 2026 Daniel Cesak

AI article illustration for ai-jarvis.eu

When you ask ChatGPT, Claude, or Gemini to process a long document today, you pay for every token the model "sees". The longer the context, the higher the bill — and the greater the hardware load. A research team from Princeton, Columbia University, Harvard, University of Maryland, NYU, and Lawrence Livermore National Laboratory now comes with a solution that shrinks context up to 16× while maintaining surprisingly high accuracy. Their open model LCLM (Latent Context Language Model) compresses input data even before the main language model starts processing it. The result? Up to 8.8× faster inference and dramatically lower operating costs.

Why the context window is AI's new bottleneck

Language models like GPT, Claude, or Gemini today routinely offer context windows of hundreds of thousands to millions of tokens. That sounds great — you can dump an entire book, several dozen documents, or a complete chat history into the model. But every token in the context consumes RAM and computing power. The longer an AI agent runs, the more tokens accumulate — from loaded documents, from reasoning traces, and from conversation history.

According to the VB Pulse survey from Q1 2026, the share of companies planning hybrid retrieval (a combination of search and compression) tripled in two months — from 10.3% in January to 33.3% in March. Retrieval optimization became the number one priority for 28.9% of surveyed organizations. The problem is real and companies are starting to tackle it intensively.

What LCLM is and how it works

Existing context compression methods — primarily so-called KV cache compression — have a fundamental drawback: they must first load the entire context into memory and only then can they shrink it. This means memory savings are only partial and the compression process itself takes extra time.

LCLM (Latent Context Language Models) work differently. They are built on an encoder-decoder architecture, where:

The encoder (0.6 billion parameters) compresses blocks of input tokens into shorter sequences of so-called latent embeddings
The decoder (4 billion parameters) processes these compressed embeddings instead of the original tokens

The key difference: compression happens before the decoder even begins to work. The higher the compression ratio, the less computation and memory the decoder consumes. On the RULER benchmark for long context, LCLM at 16× compression ran 8.8× faster than comparable KV cache methods.

Numbers worth paying attention to

The researchers trained models on more than 350 billion tokens in three variants — with compression ratios of 1:4, 1:8, and 1:16. The results on the RULER benchmark speak clearly:

No compression: 94.41% accuracy
4× compression: 91.76% accuracy — a drop of less than 3 percentage points while shrinking context to a quarter
16× compression: 75.06% accuracy — 93.75% of tokens removed, the model still works

For comparison: all tested KV cache methods achieved worse results at the same compression ratio. On GSM8K math tasks, LCLM even outperformed all other solutions regardless of compression level.

"These growing contexts consume memory and computing power and become a computational bottleneck for LLMs," Micah Goldblum, co-lead of the research from Columbia University, told VentureBeat. "Our goal was to train language models that can handle very long contexts efficiently and accurately."

What this means for companies and developers

The practical impact is enormous. With the standard KV cache approach, inference with a million tokens won't fit in the memory of a single H200 GPU. LCLM at 16× compression stays within memory limits even with such a long context.

Goldblum describes the use case vividly: "Whenever you load documents and want to dump them into the model's context, simply run them through the LCLM compressor first." The model also supports selective decompression — an AI agent can first "browse" the compressed text and then unpack only the passages it actually needs. It works similarly to how a person first skims a text and then reads into the relevant parts.

For Czech companies and developers, this means potentially lower API call costs — fewer input tokens means a lower bill. And for anyone running an LLM locally (a trend also supported by the Czech AI Factory in Ostrava), it means the ability to process longer documents on the same hardware.

What LCLM can't do yet

The researchers are honest about their results. Goldblum admitted that compressing reasoning traces at runtime is not yet solved. "A naive approach of occasional compression during generation might work, but it hasn't been verified yet," he said.

This is an important limitation especially for agentic AI systems, which often run long reasoning chains. For these, LCLM will be useful mainly for working with external documents and data — not for condensing the agent's own thought processes.

Open approach and availability

LCLM models are fully open-source and available on HuggingFace under the latent-context organization. Three variants (4×, 8×, and 16× compression) and complete source code are available on GitHub. The research paper is available on arXiv.

This means anyone — including Czech startups — can use the models for free, integrate them into their RAG pipelines, and customize them to their own needs. At a time when API call prices for commercial models are rising (recall the recent GPT-5.5 price hike), any token savings are welcome.

Context: the race for longer model memory

LCLM doesn't arrive in a vacuum. DeepSeek last December introduced its own compression model that shrinks text 10× by converting it into image form. Google this March published TurboQuant — algorithms for extreme model quantization. And startups like Multiverse Computing last year raised $215 million for technology that compresses LLMs by up to 95%.

The trend is clear: simply expanding context windows is hitting physical and economic limits. The future belongs to models that use available memory and compute more intelligently — and LCLM shows that academic research in this area can keep pace with commercial giants.

Does LCLM work with Czech?

LCLM is language-independent — it compresses tokens regardless of language. Since it was trained on 350 billion tokens from various sources, Czech should be represented in the training data. However, researchers haven't separately tested the exact compression quality for Czech texts — we recommend verifying on your own data.

Can I use LCLM with ChatGPT or Claude?

LCLM is a standalone model that compresses input before you send it to another LLM. You can therefore use it as a "pre-filter" for any model — including commercial APIs from OpenAI, Anthropic, or Google. Integration, however, requires technical knowledge and modification of your RAG pipeline.

How large is the LCLM model and what will it run on?

The encoder has 0.6 billion parameters, the decoder 4 billion. The entire model is significantly smaller than typical LLMs — for comparison, GPT-4o has an estimated billions of parameters. LCLM should run even on consumer GPUs with sufficient VRAM (we recommend at least 16 GB), but for enterprise deployment researchers tested on server GPUs like the H200.