Anthropic’s Claude 3.5 Computer Use Framework (AI Agent)
The newly released Claude 3.5 Computer Use model marks a groundbreaking milestone as the first frontier AI model to introduce computer use in public beta through a graphical user interface (GUI) AI Agent.
The closed-source nature of most commercial software presents significant challenges, as AI Agents are often unable to access internal APIs or code.
As a result, research has increasingly focused on GUI-based AI Agents that interact with digital devices using human-like mouse and keyboard actions.
Systems like such as WebGPT, Agent-Lumos, CogAgent, AutoWebGLM, Auto-GUI, AppAgent, ScreenAgent, and AssistGUI have shown enhanced performance across diverse tasks, ranging from web navigation to general GUI automation.
To improve the effectiveness of AI Agents with GUI tools, researchers have concentrated on creating systems capable of interpreting human intentions and predicting actions as function calls.
A Few General Observations
The initial design of AI Agent tools focused on accessing applications and data through APIs, requiring both API availability and custom integration for each interface.
However, many desktop commercial applications lack API support, creating a significant limitation.
The most effective alternative is leveraging the existing Graphical User Interface (GUI).
Users intuitively interact with tasks based on their experience with GUIs, making it natural for them to expect AI Agents to operate within this framework.
By granting AI Agents full access to the GUI, an unprecedented level of autonomy is achieved.
These agents transcend being merely language models; instead, they integrate vision capabilities within a robust framework comprising multiple components.
One crucial component is the AI Agent software, which orchestrates and controls the agent’s operations.
The framework also includes specialised tools, each designed for specific tasks and dynamically selected by the AI Agent as needed.
For example, the Computer Use tool focuses exclusively on GUI interaction, enabling seamless execution of tasks within a visual environment.
Introduction
Recent studies in GUI automation for AI Agents have leveraged general-purpose LLMs to interact with graphical user interfaces (GUIs) by understanding the GUI state and generating actions.
However, the release of Claude 3.5 Computer Use by Anthropic marks a significant advancement in this domain, introducing the first frontier AI model to offer computer use in public beta.
Unlike previous models, Claude 3.5 Computer Use offers an end-to-end solution, actions are generated from user instruction and observed purely visual GUI state.
ReAct Nature of AI Agents
To isolate specific aspects of the model’s capability, the study evaluates the performance of API-based GUI automation models rigorously across three dimensions:
• Planning: Assessing the model’s ability to generate an executable plan from the user’s query. The plan should have a correct flow, allowing the overall successful operations of the software, with each step being clear and executable.
- Action: Evaluating whether the model can accurately ground the intractable GUI elements and execute the action step-by-step from the derived plan.
- Critic: Measuring the model’s awareness of the changing environment, including its ability to adapt to the outcomes of its actions, such as retrying tasks if unsuccessful or terminating execution when the task is completed.
Claude Computer Use utilises a reasoning-acting (ReAct) paradigm to generate reliable actions in the dynamic GUI environment.
Observing the environment before deciding on an action ensures that its responses align with the current GUI state.
Additionally, it demonstrates the ability to efficiently recognise when user requirements are met, enabling decisive actions while avoiding unnecessary steps.
Unlike traditional approaches that rely on continuous observation at every step, Claude Computer Use employs a selective observation strategy, monitoring the GUI state only when needed. This method reduces costs and enhances efficiency by eliminating redundant observations.
Visual Information
Claude Computer Use AI Agent relies exclusively on visual input from real-time screenshots to observe the user environment, without utilising any additional data sources.
These screenshots, captured during task execution, allow the model to mimic human interactions with desktop interfaces effectively.
This approach is essential for adapting to the ever-changing nature of GUI environments. By adopting a vision-only methodology, Claude Computer Use enables general computer operation without depending on software APIs, making it especially suitable for working with closed-source software.
Error Categorisation
Representative failure cases in the evaluation highlight instances where the model’s actions did not align with the intended user outcomes, exposing limitations in task comprehension or execution.
To analyse these failures systematically, errors are categorised into three distinct sources: Planning Error (PE), Action Error (AE) and Critic Error (CE).
These categories aid in identifying the root causes of each failure.
Planning Error (PE): These errors arise when the model generates an incorrect plan based on task queries, often due to misinterpreting task instructions or misunderstanding the current computer state. For instance, a planning error might occur during tasks like subscribing to a sports service.
Action Error (AE): These occur when the plan is correct, but the agent fails to execute the corresponding actions accurately. Such errors are typically related to challenges in understanding the interface, recognising spatial elements, or controlling GUI elements precisely. For example, errors can arise during tasks like inserting a sum equation over specific cells in a spreadsheet.
Critic Error (CE): Critic errors happen when the agent misjudges its own actions or the computer’s state, resulting in incorrect feedback about task completion. Examples include updating details on a resume template or inserting a numbering symbol.
These categorisations provide a structured approach to identifying and addressing the underlying causes of task failures.
Conclusion
This study highlights the framework’s potential as well as its limitations, particularly in planning, action execution, and critic feedback.
An out-of-the-box framework, Computer Use Out-of-the-Box, was introduced to simplify the deployment and benchmarking of such models in real-world scenarios. In an upcoming article I would like to dig into the framework.
Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.