Qwen3.7: Alibaba Brings a Model for the Era of AI Agents. Autonomous Performance for 35 Hours and 1000 Steps

May 20, 2026 Daniel Cesak

AI article illustration for ai-jarvis.eu

  Alibaba yesterday officially introduced Qwen3.7-Max, its most ambitious proprietary AI model to date. While previous generations competed mainly in chat skills, the new model aims straight for the heart of the agent era — it can autonomously program, debug, control office workflows, and keep working for 35 hours straight with over a thousand tool calls. Qwen3.7-Max wants to be a universal foundation for AI agents across frameworks — from Claude Code through OpenClaw to its own Qwen Code.

Listen to this article:

Not another chatbot. It's an agent foundation

When Alibaba released Qwen3.6-Plus in April 2026, it promised a model "for real agents." At the time, it was more of a promise than reality. With Qwen3.7-Max, announced on May 18, 2026, the Chinese giant finally backs up that claim with hard data. The model was designed to handle long autonomous tasks — from frontend prototyping through complex software engineering to office process automation using MCP (Model Context Protocol) integrations.

According to the official Qwen blog post, the model excels in three key areas: coding agent (code development and fixes), productivity and workflow automation (office tools, multi-agent orchestration), and long-term autonomous execution (hundreds to thousands of consecutive steps without human intervention).

Benchmarks: Where Qwen3.7-Max shines and where it stumbles

Alibaba published extensive comparisons with competitors including Opus-4.6 Max, K2.6 Thinking, GLM-5.1 Thinking, DS-V4-Pro Max, and Qwen3.6-Plus. The results confirm that the Chinese model has reached the absolute top tier — and surpasses it in many disciplines.

Coding agent: Where agent AI starts to make sense

In programming skill tests, Qwen3.7-Max achieved a score of 69.7 on Terminal Bench 2.0-Terminus, beating DS-V4-Pro Max (67.9). On SWE-Bench Verified it scored 80.4 — comparable to Opus-4.6 Max (80.8) and DS-V4-Pro Max (80.6). The lead is even more pronounced on SWE-Pro (60.6 versus 59.0 for DS-V4-Pro Max) and SWE-Multilingual (78.3 — best of all tested models). It also delivered excellent performance on SciCode (53.5) and NL2Repo (47.2), where it handily beat the competition.

General agents: From the office to the OS kernel

It is precisely in general agent capabilities where the progress over Qwen3.6-Plus is most striking. On MCP-Mark (testing work with MCP tools like GitHub MCP and Playwright), it reached 60.8 — an improvement of over 12 points. On SkillsBench (tested via the OpenCode scaffold on 78 tasks), it scored 59.2 — 13.5 points more than Qwen3.6-Plus. On MCP-Atlas it reached 76.4, catching up with Opus-4.6 (75.8).

Special attention deserves Kernel Bench L3 — a test measuring the AI's ability to optimize GPU kernels. Qwen3.7-Max achieved 1.98× median speedup over the reference PyTorch implementation and a 96% success rate (meaning it generated faster code than torch.compile in 96% of cases). For comparison: DS-V4-Pro Max achieved only 1.07× speedup and a 54% success rate.

STEM and reasoning: Strength in the field of science

In scientific reasoning tests, Qwen3.7-Max scores GPQA Diamond at 92.4% (best of the tested group), HLE at 41.4% (second behind K2.6 with 54.0% in HLE with tools), LiveCodeBench at 91.6%, and HMMT 2026 Feb at 97.1%. On IMOAnswerBench it scored 90.0 — just behind DS-V4-Pro Max (89.8).

35 hours of autonomous work: A proof of concept with weight

The most impressive illustration of Qwen3.7-Max's capabilities is a demonstration of 35-hour autonomous kernel optimization, during which the model executed over 1,000 tool calls without any human intervention. Each test ran in an isolated Docker container with a single H100 80 GB GPU, with internet access limited to CUTLASS documentation and official CUDA documentation only.

This is a fundamental shift from the short, single-task interactions we are used to from chat interfaces. The model demonstrated the ability to maintain a consistent strategy and learn from its own mistakes over hundreds of iterations — a capability crucial for real-world deployment of AI agents.

One model, many frameworks

An interesting aspect of Qwen3.7-Max is its cross-scaffold generalization — the ability to consistently function across different agent frameworks. Alibaba tested the model in Claude Code (from Anthropic), OpenClaw, Qwen Code, and others — and performance held at comparable levels regardless of the scaffold used.

This is important news for developers: it means they don't have to be tied to a single ecosystem. The model handles environments they already know and use.

What it means for Czechia and Europe

Qwen3.7-Max will be available through Alibaba Cloud Model Studio, which has been operating in the European Frankfurt region since March 2026. This means European companies — including Czech ones — can use the model with guaranteed data residency in the EU, which is crucial for GDPR compliance and the upcoming EU AI Act.

For Czech developers and companies experimenting with AI agents (for example, for automating customer support, internal processes, or software development), Qwen3.7-Max represents an interesting alternative to models from OpenAI, Anthropic, or Google — especially if they are looking for a capable agent with lower operational costs. Czech language support in Qwen-series models has traditionally been at a high level — Qwen3.6 achieved a score of 84.3 in the WMT24++ test, and Qwen3.7-Max is expected to show further improvement in multilingual benchmarks (WMT24++ 85.8, MAXIFE 89.2).

Moreover, Qwen3.7-Max enters the market at a time when Czech companies like Ecomail are already connecting AI agents with their own services and the Czech National Bank is building its own AI center — demand for powerful and reliable agent models in Czechia is real.

Pricing and availability

Alibaba has not yet published specific pricing for Qwen3.7-Max — the model is only now being made available through Alibaba Cloud Model Studio. For reference, Qwen3.6-Plus pricing runs around 0.5–2 USD per million input tokens and 3–6 USD per million output tokens. Qwen3.7-Max can be expected to be in the premium category, but still significantly cheaper than comparable Western models.

For individual developers, Model Studio offers 1 million free tokens per model, allowing risk-free testing.

What is the difference between Qwen3.7-Max and Qwen3.6-Plus?

Qwen3.6-Plus was primarily focused on multimodal capabilities (text, image, audio) with a million-token context window and agent coding. Qwen3.7-Max goes a step further — shifting the emphasis to long-term autonomous execution (hundreds to thousands of steps), significantly better performance in agent benchmarks (MCP-Mark, SkillsBench), and the ability to function across different agent frameworks. It represents a shift from "a model with agent capabilities" to "a model built for the agent era."

Will Qwen3.7 be available as open-source, or only via API?

Qwen3.7-Max is a proprietary model and will be available only via API on Alibaba Cloud Model Studio. With previous generations, Alibaba released smaller variants (e.g. Qwen3.6-35B-A3B) under the open-source Apache 2.0 license. It has not yet been announced whether a similar open variant will appear for Qwen3.7. For regular use and testing, the free 1 million token credit on Model Studio can be used.

Can Qwen3.7-Max replace developers in programming?

Not entirely. The model achieves excellent results in agent coding — on SWE-Bench Verified it scored 80.4, meaning it can solve over 80% of real GitHub issues. However, it still requires human oversight, especially for complex architectural decisions, safety-critical systems, and tasks requiring understanding of broader business context. It is more of an extremely capable assistant than a full replacement.