Chinese IVLR Research Teaches Robots to Think in Text and Images. They Handle Complex Tasks Almost Every Time

May 4, 2026 jarvis

AI article illustration for ai-jarvis.eu

Researchers from Tsinghua University and Xiaomi have introduced the IVLR method, which enables robots to plan long-horizon tasks through alternating textual goals and visual keyframes. In tests, they achieved over 92 percent success.

Why robots need to plan like humans

Imagine a robot that has to prepare coffee, tidy a table, and then open a window. For a human, this is trivial. For a robot, however, it means mastering a long-horizon task — a sequence of steps where each subsequent one depends on the previous, while the robot must continuously perceive the space around it. Current approaches in Vision-Language-Action (VLA) policies often hide planning inside neural networks, leading to errors during more complex operations. Either the robot forgets what to do first, or it fails to correctly estimate where exactly to place an object.

Textual chain-of-thought reasoning helps maintain a logical sequence of actions but lacks spatial imagination. Conversely, purely visual predictions offer geometric guidance, yet without semantic context they can be confusing. This complementarity inspired an international team of scientists to develop a new approach.

IVLR-Trace: Storyboard for robots

Scientists Jinkun Liu, Haohan Chi, and their colleagues from Tsinghua University and Xiaomi Group proposed a method called Interleaved Vision–Language Reasoning (IVLR). Its core is the so-called IVLR-Trace — an explicit intermediate representation that alternates textual subgoals with visual keyframes throughout task completion. It is essentially a kind of storyboard: the robot first generates a plan in the form of "pick up the white cup – [image] – place the cup on the tray – [image] – pick up the yellow cup…" and then uses this plan as context during actual execution.

The key is that the policy remains closed-loop. The robot does not blindly copy the generated images; at each step, it takes into account the current observation of the environment along with the original task and the stored trace. This allows it to react to minor deviations while maintaining a global overview of the entire task.

How it learns when data is scarce

Standard robotic demonstration data do not contain such alternating text-image traces. The authors therefore created a pipeline for pseudo-supervision: they split trajectories into phases, selected a keyframe from each, and described it using a vision-language model. This process is not perfect, but it allows training a policy on standard datasets without manual annotation. Training ran on 16 NVIDIA H200 GPUs — on the LIBERO benchmark it took about 4 hours (40 thousand steps), on SimplerEnv roughly 6 hours (60 thousand steps).

Results that beat the competition

The method was tested on two recognized simulated benchmark suites. On LIBERO, which contains four task sets testing spatial generalization and long horizon, IVLR achieved 95.5% average success. On the most difficult set, LIBERO-Long, it reached 92.4%. For comparison: the CoT-VLA method achieved 69.0% and VLA-0 87.6%. On SimplerEnv-WidowX, which simulates visual distribution shifts, IVLR recorded 59.4% overall success, while SpatialVLA only 42.7%.

Most interesting, however, are the ablations — experiments where the authors gradually removed individual components. Without the trace, success on LIBERO-Long dropped to 37.7%. The textual trace alone raised the score to 62.0%, purely visual to 68.4%. Only their combination brought the 92.4%. This unequivocally proves that text and image complement each other and both are essential for long tasks.

Resilience to errors

The authors also conducted stress tests. With random perturbations of the end effector by 2 cm and with 30% of the trace content masked, performance decreased slightly but remained well above the level of methods without trace. This suggests that the representation is robust to local errors and minor deviations during execution. Limitations mainly appear with outdated or incorrect global plans — if the robot poorly estimates the overall scenario at the beginning, the trace will no longer help.

What this means for the Czech Republic and Europe

This research comes from China, where the collaboration between a leading academic institution (Tsinghua) and a tech giant (Xiaomi) illustrates how quickly research and commercial applications of robotics are advancing. Xiaomi is also active on the Czech market, although their humanoid or manipulation robots are not yet commonly available in the Czech Republic. For the European scene, it is important that similar technologies fall under future regulation of the EU AI Act, which emphasizes transparency and safety of autonomous systems. IVLR, with its explicit trace, naturally offers explainability — the robot's plan is readable, controllable, and debuggable, which can facilitate certification in the European environment.

While initiatives such as Robotics4EU or projects under Horizon Europe are developing in Europe, the Chinese approach shows that the key to practical robots is not just more data, but better reasoning representation. Czech companies and research teams could draw inspiration precisely from the direction of connecting language models with visual prediction, where the Czech Republic hosts a number of top workplaces in computer vision and AI.

Conclusion

The IVLR method represents a significant step forward in the field of long-horizon robotic manipulation. It shows that combining text and image into a single explicit trace enables robots not only to plan better, but also to withstand common operational errors. With over 92% success on demanding tasks, the path opens to robots that can truly handle complex domestic or industrial operations — and be able to explain what they are doing.

Is the IVLR method available as open source?

The authors have not yet released the complete code or pretrained models. Given that the paper was published on May 1, 2026, it can be expected that the implementation will follow in the coming weeks or months, as is common in the field.

What is the difference between IVLR and common language models like GPT?

While GPT primarily works with text, IVLR is a specialized robotic policy that combines textual planning with visual keyframes and directly generates motor actions. It is therefore a Vision-Language-Action model, not a pure conversational system.

When might we see similar robots in Czech households?

The results come from simulation, not the real world. Transfer to practice usually takes several years. Given Xiaomi's involvement, however, we can expect commercial applications to appear sooner than in purely academic projects.

Sources: Liu et al., arXiv:2605.00438 (2026); LIBERO Benchmark, arXiv:2306.03310