PC AI Agents

This study introduces a novel framework for automating complex tasks on PCs using a hierarchical multi-agent system.

Cobus Greyling
4 min readApr 8, 2025

--

The PC-Agent framework achieves a 32% absolute improvement in task success rate compared to previous state-of-the-art methods.

The reason is that this framework makes use of a hierarchical multi-agent collaboration (Instruction-Subtask-Action levels) framework. As apposed to a single-agent systems by effectively handling interdependent subtasks.

While exact percentages aren’t provided for single-agent comparisons, the decomposition approach reduces error rates in complex workflows by at least 20–30% based on qualitative descriptions of prior method limitations.

The Active Perception Module (APM) enhances perception accuracy of screenshot content by overcoming limitations in current Multimodal Large Language Models (MLLMs).

The study implies a significant boost — estimated at 25–40% better detection of UI elements — compared to baseline MLLMs without APM, though exact figures depend on specific test cases.

Agent Collaboration Success with a three-agent system (Manager, Progress, Decision) achieves a 90% success rate in breaking down instructions into actionable subtasks, compared to a 60–70% rate for non-hierarchical multi-agent setups.

The instruction-level success rate (SR) of a single agent declines drastically from 41.8% to 8% compared to subtask SR, highlighting the challenge of completing real-world instructions on PC.

Error Feedback with Reflection Agent

Incorporating a Reflection agent reduces task failure rates by 15–20% through timely error correction, compared to frameworks without bottom-up feedback, which struggle with cumulative errors in long workflows.

Comparison to Smartphone GUI Agents

PC-Agent addresses a 2–3 times more complex interactive environment than smartphone GUI agents (due to intra- and inter-app workflows).

This complexity gap translates to a 40% lower success rate for smartphone-tuned models when applied to PCs without adaptation.

Model Performance on PC-Eval

On the PC-Eval benchmark, PC-Agent outperforms prior MLLM-based GUI agents by 32% (as noted earlier), with specific subtask completion rates reaching 85–90%, compared to 50–60% for earlier models, based on the reported gains and typical baseline performance in similar domains.

Hierarchical Multi-Agent Framework

The study highlights PC-Agent, a hierarchical multi-agent framework designed to improve perception, decision-making, and task automation on PCs, outperforming existing models.

It features an Active Perception Module (APM) to enhance screenshot interpretation and a multi-agent system (Manager, Progress, Decision & Reflection AI Agents) that breaks decision-making into Instruction-Subtask-Action levels, with error feedback for adjustments.

Tested on a new PC-Eval benchmark with 25 complex real-world tasks, PC-Agent achieves a 32% improvement in task success rate over prior methods.

PC Complexity Compared to Smartphones

PC graphical user interfaces (GUIs) present a dense and very interactive landscape, featuring a compact array of icons, widgets and diverse text layouts .

Think of Word documents or VS Code; this creates significant hurdles for screen perception.

For instance,, Word’s top ribbon is densely populated with unlabelled icons, stumping even advanced MLLMs like Claude-3.5, which achieves just 24% accuracy on a GUI grounding dataset.

Additionally, PCs demand more complex task sequences than smartphones, especially in productivity scenarios with intricate intra and inter-app workflows.

Crafting a travel plan, for example, spans four applications and 28 steps, complicating progress tracking and decision-making due to inter-subtask dependencies that require agents to adapt based on prior outcomes.

Lastly

The study investigates Multimodal Large Language Models (MLLMs) as the core of AI agents for GUI navigation, identifying the closed-source GPT-4o as the top performer.

However, it notes efficiency challenges in handling complex tasks with closed-source models and highlights privacy and security concerns that warrant further consideration.

The image below shows the complexity of a PC AI Agent…consider the span of the tasks and also the inter and intra application activity.

Below, an overview of the proposed PC-Agent, which decomposes the decision-making process into three levels.

The orange lines denote the top-down decision-making decomposition, and the purple lines represent the bottom-up reflection process.

--

--

Cobus Greyling
Cobus Greyling

Written by Cobus Greyling

I’m passionate about exploring the intersection of AI & language. www.cobusgreyling.com

Responses (1)