In the field of artificial intelligence, attention has shifted in recent months from simple "chatting" to creating autonomous agents. These are systems that not only make up answers, but actually write software that then performs specific tasks. A recent test where different models had to program bots for assembling polyominoes provided us with a unique insight into how well individual market leaders handle spatial reasoning and precise programming.
What are polyominoes and why are they a challenge for AI?
For laypeople: Polyominoes are shapes composed of n squares joined edge-to-edge (e.g., a domino is a polyomino of two squares, an L-shape of three). Assembling these shapes into a certain space requires not only logic, but primarily spatial reasoning. For language models, which are primarily trained on text streams, the concept of "shape in space" is naturally more difficult than constructing a sentence.
If a model can write code that solves this problem, it means that its internal world model contains a deep understanding of geometry and logical rules, not just statistical probability of the next word. This shift is crucial for the future automation of software engineering.
Battle of top models: Benchmarks and capabilities
According to current data from Tech Insider and other analyses, we can divide individual models in this type of task according to their specific strengths:
Claude (Anthropic) – King of logic and mathematics
The Claude Opus 4.6 model has long held the top spot in tests focused on complex coding and mathematical reasoning. Thanks to its ability to work with a huge context window (up to 100 thousand tokens), Claude can analyze entire code repositories at once. In tasks like assembling polyominoes, Claude should excel in algorithmic precision. Price: For advanced users, it offers a Pro plan for approximately 20 USD/month.
Gemini (Google) – Multimodal advantage
Gemini 3.1 has a unique advantage in this experiment: multimodality. While other models "see" the task only through a text description, Gemini can directly process visual inputs (images and video). This means it can directly analyze the visual state of assembled pieces. Price: Standard subscription is around 20 USD/month; integration into Google Workspace is another option.
DeepSeek – Efficiency at a fraction of the price
The Asian player DeepSeek V4 represents a fascinating technological leap. It uses a Mixture-of-Experts (MoE) architecture, where out of a total of 671 billion parameters, only 37 billion are active per token. This makes it an extremely fast and cheap model that in coding benchmarks often catches up to GPT-4. Price: DeepSeek is known for its extremely low price, often offering a very generous free tier, and its API is significantly cheaper than the competition.
GPT (OpenAI) – The gold standard
GPT-5.4 remains an indispensable all-rounder. Its strength lies in the balance between creativity and logic. Although Claude may narrowly win in pure mathematics, GPT offers the most stable environment for agent development thanks to broad support for tools and plugins.
Quick comparison for programming agents:
- Best logic/mathematics: Claude
- Best visual analysis: Gemini
- Best price/performance ratio: DeepSeek
- Best versatility: GPT
Practical impact: What does it mean for you?
This experiment is not just about playing with Tetris. It shows the path to autonomous software development. For an ordinary user, this means that in the near future you won't write code, but define goals. For example: "Create an app that will automatically stack logistics containers in a warehouse according to their volume."
For companies in the Czech Republic and throughout the EU, this represents a huge opportunity and challenge. Automating programming can drastically reduce software development costs, but at the same time raises questions about security and liability for machine-generated code. In the context of the EU AI Act, systems that function as autonomous agents will likely be subject to stricter regulation if they affect critical infrastructure or safety.
Availability in the Czech Republic and language support
Good news for Czech users is that all the above-mentioned models (Claude, Gemini, GPT, DeepSeek) are fully available in the Czech Republic through web interfaces and APIs. What is important is that all these models exhibit a very high level of Czech language localization. They can not only understand Czech instructions, but also generate code with comments in Czech or explain solution logic in our language.
Can I use these models for my own software development in Czech?
Yes, all models handle the Czech language very well. You can input instructions in Czech and the models will generate code or documentation that will work perfectly with the Czech context.
Is it safe to let AI write code for important systems without supervision?
Currently, no. Although models like Claude or GPT exhibit a high degree of accuracy, they can still generate logical errors or security vulnerabilities. It is recommended to use AI as a "copilot," not as a full-fledged, independent developer.
What is the cheapest way to get started with these models?
The cheapest way is to use the free versions (free tier) of ChatGPT or Gemini, or use DeepSeek, which offers very cheap API for developers, which is ideal for testing larger projects with minimal costs.