Listen to this article:
The long era where automation was tied to the technological maturity of software (so-called APIs — interfaces for communication between programs) is beginning to change. Microsoft, as part of its Copilot Studio platform, announced the integration of "computer use" capability. This feature, currently in an early research preview, enables AI agents to "see" and "control" the screen just like a human operator.
How does AI that "sees" your screen work?
Traditional automation (RPA — Robotic Process Automation) was often brittle. If a button's color or position changed in an application, the process failed. Microsoft's new approach uses deep reasoning. The agent doesn't just look for a specific pixel on the screen, but understands the context of what it sees.
Thanks to the integration of advanced models, the agent can evaluate in real time that, for example, an unexpected pop-up window has appeared, and attempt to close it so it can continue with the task. This process includes:
- Perception: Analysis of the visual content of a browser (Edge, Chrome, Firefox) or desktop application.
- Planning: Breaking down a complex task (e.g., "Find an invoice in email and enter it into our accounting system") into individual steps.
- Action: Simulating mouse movement, clicking on elements, selecting from menus, and typing text into fields.
This shift is crucial for companies that still use older (legacy) systems that lack the ability to connect via modern APIs. The agent becomes a "digital colleague" that learns to work with your old software just like you do.
Comparison with the competition: Who's leading the battle for desktop agents?
Microsoft is not the only player in the "computer use" field. In recent months, we have seen significant moves from other market leaders:
- Anthropic: Their Claude 3.5 Sonnet model introduced a very similar computer control capability, focused on a high degree of accuracy in the browser.
- OpenAI: Is working on similar agent capabilities for its GPT models, which are expected to be deeply integrated into the Windows operating system.
- Google: Through Gemini, it is striving for similar integration into Google Workspace, but so far is more focused on working within cloud documents than on directly controlling the desktop OS.
Microsoft's advantage, however, is the ecosystem. Copilot Studio is not just a model; it is a platform that allows developers (makers) to build entire workflows that are directly connected to data in Microsoft 365 and Dynamics 365.
Practical impact: What does it mean for Czech companies?
For the Czech market, which is strong in services, manufacturing, and mid-sized companies, this innovation holds enormous potential. Many Czech companies still work with local, specific software for accounting, warehousing, or production management, which is often closed and lacks an API for easy integration with modern AI.
Use case examples:
- Accounting services: The agent can automatically transcribe data from PDF invoices directly into an older accounting program, where a human would otherwise have to manually retype numbers.
- Customer support: The agent can browse internal databases and customer portals in real time to find an answer for the operator.
- Administration: Automatic processing of e-shop orders that require manual confirmation in the website's admin panel.
From the perspective of EU regulation (AI Act) and data protection, it is important to emphasize that Microsoft states that data remains within the boundaries of the Microsoft Cloud and is not used to train foundational models (Frontier models). This is a key factor for European companies that must comply with strict GDPR standards when adopting these tools.
Pricing and availability
Microsoft Copilot Studio is not a free tool. For enterprise users, it is typically part of broader Microsoft 365 or Dynamics 365 licensing. Pricing policy: Microsoft generally offers subscriptions for Copilot Studio as part of enterprise licenses, with prices ranging in the tens to hundreds of USD per month per capacity/user (exact amounts vary depending on the type of Microsoft contract). For regular users within personal Copilot versions, a limited free version is available, but the full "computer use" agent capability is primarily intended for the business segment.
Localization: Although primary development is focused on English, Microsoft is gradually expanding support for European languages. For the Czech market, it is crucial that the agent can interact with Czech text and Czech websites, which is essential for the Czech sphere.
Is controlling a computer with AI safe? Could AI accidentally delete important data?
Microsoft implements robust security frameworks and governance. Agents in Copilot Studio operate in a controlled environment. However, since this is a "research preview," it is necessary to expect that the agent may make mistakes. It is recommended to start with tasks that require human supervision (human-in-the-loop) before entrusting the agent with full control over critical systems.
Do I need to create special preparation for each application so the AI can control it?
No, that is precisely the main advantage. Unlike traditional automation, where you must define every step and every field, the AI agent, thanks to the "computer use" capability, sees the interface and understands it through visual reasoning. You just need to give it a goal.
Will this feature work on macOS as well, or only on Windows?
The current announcement focuses on desktop applications and browsers commonly used in corporate environments, which primarily includes Windows and browsers such as Edge or Chrome. Detailed support for macOS depends on future Microsoft updates.