Skip to main content

Digital Hunger: Why AI Models Are Being "Milked" and What It Means for the Future of the Internet

Ilustrační obrázek
State and private research teams are warning: the rapid growth of large language models (LLMs) is hitting physical and logical boundaries. The internet, which once served as an endless source of information for AI training, is beginning to run dry. If development companies don't pivot to new data sources, stagnation or the phenomenon known as "model collapse" looms on the horizon.

In recent years, we have witnessed a constant increase in the parameters and capabilities of models like GPT-4, Claude 3.5, or Gemini. All these systems learn from vast amounts of text available on the internet. However, as an analysis on CDR.cz points out, this source is not infinite. AI systems are "reading" the internet so fast that they are approaching a point where there will be nothing new and of quality left for them.

Digital hunger: Why are AI models starting to "run dry"?

To understand the problem, it's essential to grasp how LLM training works. Models are not just programs; they are statistical representations of human knowledge extracted from text data. The more quality data a model processes, the better it understands language nuances and logic. The problem arises when we hit what's called the "Data Wall".

Most publicly available foundational texts — from Wikipedia through scholarly articles to digital libraries — have already been absorbed into the training sets of the biggest players. If models start learning only from data that was created by other AI (so-called synthetic data), a serious danger emerges. This process leads to quality degradation, where the model begins repeating the mistakes of its predecessors and loses the ability to generate creative or unique responses. Experts call this phenomenon model collapse.

Comparison of market leaders in the context of data demands

The current top players in AI are trying to tackle this problem through various methods:

  • OpenAI (GPT series): Strives for closed ecosystems and partnerships with publishers (e.g., Axel Springer) to gain exclusive access to quality data.
  • Google (Gemini): Holds a massive advantage in controlling YouTube and the search index, providing unique multimedia data (video, audio) that is crucial for training the next generation of models.
  • Anthropic (Claude): Emphasizes safety and "constitutional AI," which requires extremely precisely filtered and high-quality data, not just quantity.

From a pricing standpoint, the situation is stable for users but challenging for development. While ChatGPT Plus costs about 20 USD (around 470 CZK) per month, the costs of training new models are growing exponentially, which could mean that even advanced models will become increasingly expensive for everyday users in the Czech Republic.

Infrastructure under pressure: Not just data, but energy too

The shortage isn't just about "content," but also physical capacity. Training and subsequent operation (inference) of models require enormous amounts of computing power. This leads to extreme demand for GPU chips from Nvidia and massive growth in electricity consumption at data centers.

For the European scene, this means further regulatory pressure. The EU AI Act seeks to ensure that development proceeds transparently, while also needing to address the energy intensity of these technologies in line with the EU's climate goals. For Czech companies, this means that when implementing AI into their processes, they must consider not only API pricing but also the stability and local availability of services that meet European data protection standards (GDPR).

Practical impact: What does this mean for you?

If you're an everyday user or entrepreneur in the Czech Republic, you won't see this problem as "slow internet," but rather as a shift in the character of AI tools. Instead of endless growth in capabilities, we will likely see these trends:

  1. Specialization: Instead of one all-knowing model, companies will use smaller, highly specialized models trained on specific, private data (e.g., legal or medical AI).
  2. High price of quality: Truly intelligent and reliable AI will likely become more expensive, while "general" AI will be available for free but with a higher risk of hallucination.
  3. Return to human creation: Quality, unique texts and human experience will have higher value for AI training than ever before. This may paradoxically increase the value of creative professionals' work.

In the Czech context, it's important to monitor whether tools like ChatGPT or Claude fully support Czech within these new, data-constrained architectures. So far, Czech language models remain somewhat less sophisticated than their English counterparts, which is precisely a consequence of the smaller volume of available data for training in our language.

Does this mean AI will stop improving?

No, but the way it improves will change. Development will shift from merely "scanning the internet" to sophisticated learning from curated data, simulations, and the use of synthetic data with high quality control.

How can a Czech company address the data shortage for its own AI?

The best approach is building proprietary, high-quality datasets from internal processes. Leveraging your own documents through RAG (Retrieval-Augmented Generation) is currently the most effective path to intelligent AI without needing to train a new model from scratch.

Will AI text generation harm the internet?

There is a risk of "digital pollution," where the internet becomes flooded with low-quality AI content. This could lead to future models struggling to distinguish truthful information from generated hallucinations.

X

Don't miss out!

Subscribe for the latest news and updates.