AI agents in code: Why do benchmark promises clash with the reality of real-world development?

June 14, 2026 jarvis

  Autonomous AI agents, such as Claude Code or Devin, are perceived as the next big step in software engineering automation. While they achieve astonishing results in controlled tests, the reality of complex projects is much harsher for them. Research teams from MIT and other institutions warn: an agent's ability to "know" does not mean it can effectively "do" in the real world.

In recent months, we have witnessed a massive shift from simple code completion (autocomplete) to the era of autonomous agents. These systems no longer just offer text, but can read files, run tests, and fix bugs. However, as current research shows, there is a gap between what agents can do in the lab and what they can achieve in a real repository.

The Benchmark Trap: Why "High Scores" Can Lie

One of the main problems with current AI evaluation is the way so-called skills are tested. These skills are essentially specialized instructions or code fragments that help an agent solve specific tasks, such as working with a particular API.

According to a study by researchers from UC Santa Barbara and MIT CSAIL, published by The Decoder, these tests are often overly "praised". In the SKILLSBENCH testing environment, agents are given skills directly on the wall. This is like giving a student a question with an open textbook containing the exact solution right next to them.

However, as soon as the situation becomes realistic – that is, when the agent has to search through a huge library of thousands of skills (in the given study, it was 34,000 items) – performance drops drastically. For the Claude Opus 4.6 model, success rate dropped from 55.4% with direct instruction delivery to 38.4% when it had to search on its own. For weaker models, it's even worse: for models like Kimi K2.5 or Qwen3.5, the presence of irrelevant skills in the library can even reduce their overall ability to solve a task.

The "Right Neighborhood" Problem: I Find the File, But Not the Line

Even if you give the agent the correct context, it encounters another barrier: precision. The new SWE-Explore benchmark from researchers at Shanghai Jiao Tong University shows that AI agents suffer from the "right neighborhood" syndrome. The agent can identify which file contains the error, but cannot precisely locate the lines that need to be changed for a functional fix.

Traditional metrics have so far only asked: "Did it fix what it was supposed to?" If yes, the agent won. SWE-Explore, however, goes deeper and examines the code exploration process itself. It shows that agents often "wander" within the correct file but miss key logical blocks. This means that even if the result looks functional, the path to it was inefficient and prone to errors that could cause regression bugs in the future.

Clash of the Titans: Claude Code vs. Devin vs. Cursor

In the context of 2026, three main players have established themselves in the market, each approaching autonomy differently. For a Czech developer or company, it is important to know what they are paying for:

Claude Code (Anthropic): Currently considered a leader in understanding complex architecture and refactoring. It works directly in the terminal and has an excellent "extended thinking" mode.
Price: Requires a Claude Pro or Team subscription (approx. 20–30 USD/month).
Devin (Cognition): The first full-fledged autonomous agent that strives for complete human independence. It is capable of solving tasks for tens of minutes without intervention.
Price: Primarily focused on the enterprise segment, price on request (significantly higher than standard tools).
Cursor: An IDE built directly around AI, combining editor convenience with agents. It is the best choice for quick integration into daily work.
Price: Free tier available; Pro version approx. 20 USD/month.

Practical Impact: What Does This Mean for Czech Companies and Developers?

For the Czech IT scene, which is strongly oriented towards service export and high-end software, this research brings an important warning. You cannot leave AI agents completely unsupervised.

1. Human-in-the-loop is a necessity: Due to precision problems (missed lines), every commit created by an agent must undergo a thorough Code Review by a human developer. AI agents are great for a "first draft," but not for final approval.

2. Availability and language: Although these tools primarily work with English code, their ability to understand assignments in Czech is constantly improving thanks to models like Claude or GPT-5. For Czech companies, this means they can use agents for documentation or code commenting even in a local context, but the logic itself must remain strictly in English.

3. EU Regulation (AI Act): Transparency will be key within the European market. If a company uses autonomous agents to generate software, it must be able to demonstrate how the security and integrity of the code were ensured, which, given the current shortcomings in agent precision, poses a challenge for compliance.

Can I use AI agents for critical infrastructure or banking systems?

Currently not without extremely strict human supervision. Research shows that agents can omit key lines of code during fixes, which poses an unacceptable risk in critical systems.

Do I need perfect English to use these tools?

For the code itself, yes, because English syntax and terminology are standard. However, for task input (prompting), modern models handle Czech very well, allowing Czech developers to work more efficiently.

How do I know if the agent truly understood my project and isn't just "guessing"?

Observe the exploration process. Quality tools like Cursor or Claude Code allow you to see which files the agent has read. If the agent's log does not show active reading of relevant files, it is likely just generating code based on patterns, not on your actual architecture.

AI agents in code: Why do benchmark promises clash with the reality of real-world development?

The Benchmark Trap: Why "High Scores" Can Lie

The "Right Neighborhood" Problem: I Find the File, But Not the Line

Clash of the Titans: Claude Code vs. Devin vs. Cursor

Practical Impact: What Does This Mean for Czech Companies and Developers?

Don't miss out!