Oppo Unveils X-OmniClaw: Android AI Agent That Sees, Hears, and Controls Apps Directly on the Phone

May 17, 2026 Daniel Cesak

AI article illustration for ai-jarvis.eu

Oppo is entering the field of mobile AI agents with open source. The X-OmniClaw model from the Multi-X research team runs directly on Android devices, connecting the camera, screen, and voice into a single system and can autonomously control applications — from price comparisons to photo album creation to learning shortcuts into deeply nested menus. And all without the need to copy data to the cloud.

Listen to this article:

Agent in your pocket: X-OmniClaw runs directly on your phone

Oppo's Multi-X research team has published a technical report and open-source code for the X-OmniClaw agent — a system that combines camera, screen, and voice control into a single autonomous assistant for Android. Unlike competing solutions that run agents in virtualized copies of phones in data centers (such as RedFinger, Alibaba Wuying, or Tencent Cloud Phone), X-OmniClaw works directly on the physical device. This means access to local sensors, the camera, and personal data without uploading to the cloud.

The technical report was published on arXiv on May 7, 2026, and all code is available on GitHub under the Apache 2.0 license. The project builds on the HermesApp codebase and is inspired by the OpenClaw concept — a well-known open-source agent primarily focused on computers.

Three pillars: perception, memory, and action

The X-OmniClaw architecture rests on three main components that the researchers call Omni Perception, Omni Memory, and Omni Action.

Omni Perception: a single pipeline for all inputs

The agent combines three input channels into one. The visual-language model (VLM) first interprets the scene from the camera along with the user's request — for example, "How much does this cost on Taobao?" — and only then passes a structured intent for execution. A key role is played by the temporal alignment module, which synchronizes screenshots, audio input, and real camera feed into a coherent context. Without it, the agent wouldn't know that a question asked by voice relates to the product you're currently pointing the camera at.

Omni Memory: gallery as a searchable database

An interesting feature is long-term memory built on local data. During idle periods, the agent processes photos from the gallery into compact text descriptions — objects, scenes, and events — and stores them in a Markdown file image-memory.md. Each entry passes through a filter that removes sensitive information. The researchers admit in the technical report that the current solution still sends images to a cloud visual model for processing, but they plan to move to fully local models as the next step, so that raw images never leave the phone.

Omni Action: learning instead of repeating

Instead of planning every action from scratch, the agent clones user behavior into reusable skills. When you manually click through a path to a deeply nested page (for example, a discount offer in the Meituan app), X-OmniClaw extracts a command to launch that page and next time jumps there via deeplink — without tedious repetition of taps. For detecting clickable elements, it combines the app's XML structure with a visual grounding model and OCR, which is especially helpful in interfaces full of ads where pure XML fails.

What X-OmniClaw can do in practice

The researchers demonstrated four real-world scenarios:

Price at a glance: Point the camera at a product, ask for the price — the agent opens a shopping app, goes through results, takes screenshots, and reads prices and sales numbers.
Digital tutor (ScreenAvatar): The agent acts as a floating assistant that solves exercises on the screen by itself — one by one, including selecting the correct answers.
Photo album on command: A voice command "make a video from photos of parrots" is enough for the agent to search the gallery memory, find matching images, and move them to CapCut for automatic video creation.
One-word shortcuts: Record a path to a deep menu once, and next time a voice command is enough — the agent uses deeplinks even where the app doesn't offer public deeplinks.

How it stacks up against the competition

X-OmniClaw enters a space that is rapidly filling up. OpenClaw, developed by Peter Steinberger, focuses on computers and servers — its main domain is developer workflow automation. Hermes Agent from Nous Research relies on emergent model capabilities. X-OmniClaw differentiates itself with an emphasis on edge-native architecture — it runs on the physical phone and only calls the cloud when necessary.

Google recently showed with the Gemma 4 model that a fully local model on a smartphone can already handle agent tasks (Wikipedia queries, QR code generation, opening mood trackers). X-OmniClaw, however, is designed as an open platform — it supports connecting various LLM models via API (OpenRouter, Anthropic Claude, OpenAI GPT, Moonshot Kimi, MiniMax, and via Ollama even local models).

Methodologically, X-OmniClaw follows up on ByteDance UI-TARS — a purely visual GUI agent that works only with screenshots and coordinates. Oppo adds structural XML data and on-device execution to this approach, reducing the error rate of purely visual pipelines on dynamic interfaces.

Technical details for developers

The project is written 95% in Kotlin, with smaller portions in Python and Java. It requires Android 8.0 or higher and offers an APK download on GitHub Releases. As of April 2026, it supports:

Multiple parallel sessions with isolated execution — each agent runs in its own thread with precise stopping
Scheduled automation at intervals, on weekdays, or in weekly plans
Voice-visual loop — recording, screenshots, decision, and execution in a single flow
10 pre-built skills for gallery, search, model management, and more

To run the agent, you need to provide an API key for one of the supported LLM providers. Configuration is stored locally in a file on the SD card and is configured directly in the app. For speech transcription, you can use, for example, SiliconFlow SenseVoice Small; for visual understanding, a VLM model via OpenRouter.

What this means for Czech users

As technically impressive as X-OmniClaw sounds, for the average Czech user it currently has a rather experimental character. It is not localized into Czech — all interface, documentation, and supported language models communicate primarily in English, or possibly Chinese. Voice control does not yet support Czech. Installation requires manual APK download from GitHub (so-called sideloading), which carries security risks if you don't know what you're doing.

Nevertheless, it's an important signal. Open code means that anyone — including Czech developers — can modify X-OmniClaw, translate it into Czech, and connect it to local models. With the growing availability of open-source language models supporting Slavic languages, the path is opening to an autonomous assistant that truly understands Czech and runs fully on your phone.

Does X-OmniClaw require a constant internet connection?

Partially. Core perception and control operations run on-device, but for complex reasoning the agent calls a cloud language model. Technically, it's possible to connect it to a local model via Ollama and run completely offline, but performance will depend on your device's power and the quality of the local model.

Does X-OmniClaw support iPhones or other platforms?

No, X-OmniClaw is designed exclusively for Android 8.0 and higher. There is no version for iOS, due to the closed nature of the Apple ecosystem, which doesn't allow the same level of system permissions needed for autonomous app control.

Which model is best suited for X-OmniClaw?

The Oppo team recommends models with multimodal support (the ability to process images), such as Qwen 3.6 Flash via OpenRouter, Claude Opus 4 from Anthropic, or GPT-4.1 from OpenAI. For speech transcription, SenseVoice Small via SiliconFlow has proven effective.