The Critical Gap For AI Agents: From Simple Task Automation to Complex Work Completion

There is a significant gap between the capabilities of AI Agent Computer Interfaces and human-level performance. Acknowledging this gap is essential in developing effective solutions to bridge it.

5 min readJan 21, 2025

Introduction

Human cognition transfer offers a promising solution to bridging the gap between AI capabilities and human-level performance.

By replicating the complex cognitive processes humans use to understand, analyse, and solve problems, AI Agents can improve their decision-making and adaptability in real-world scenarios.

Transferring human cognition to AI Agents allows them to maintain contextual awareness over extended interactions, make dynamic decisions in changing environments and refine strategies based on outcomes.

This approach not only enhances the efficiency and accuracy of AI Agents but also enables them to tackle complex, multifaceted tasks that were once limited to human expertise.

GUIs Are Human APIs

AI Agents have evolved from being integrated through APIs and backend systems to interacting directly with the graphical user interface (GUI), offering significant advantages in terms of integration and emulating human behaviour.

This shift enables agents to perform tasks within a more natural, human-like workflow, leveraging the GUI as the API of human-computer interaction, rather than running parallel processes.

However, despite this progress, current AI Agents still significantly underperform humans in complex computer tasks (Anthropic, 2024).

While they can handle simple tasks like web searches and file copying, they struggle with more comprehensive, real-world tasks such as video editing, presentation creation, and report generation.

Another element which has surfaced in a number of studies is the disruptive nature of pop-ups and unexpected GUI elements.

These tasks demand sustained operation across multiple applications, sophisticated decision-making, and even human-level aesthetic judgment.

True digital AI Agents must be capable of handling these complex workflows to meaningfully reduce human workload and increase efficiency.

Experiences Shared By Anthropic

Anthropic stated that there is still a significant gap to bridge before AI Agents like Claude ACI can match human capabilities in computer use.

Currently, Claude’s interactions with computers are slow and prone to errors.

Many routine human actions, such as dragging or zooming, are beyond Claude’s abilities. The flipbook approach, where Claude pieces together screenshots rather than observing a continuous video stream, results in missing transient actions or notifications.

Even during the launch demonstrations, Claude made some amusing yet critical mistakes — accidentally stopping a screen recording, losing all footage, or veering off from a coding demo to browse photos of Yellowstone National Park. These incidents highlight the challenges ahead.

For Claude to truly match human performance, improvements are needed in speed, reliability, and versatility. This includes enhancing its ability to seamlessly handle complex tasks and spontaneous events. As computer use capabilities advance, they must also become more accessible to users with limited technical skills.

The 𝗖𝗹𝗮𝘂𝗱𝗲 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁 𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿 𝗜𝗻𝘁𝗲𝗿𝗳𝗮𝗰𝗲 (𝗔𝗖𝗜) performance sits at 14% of that of humans.

The graph below from TheAgentFactory is an indication of where AI Agents sit in term of cost, steps and success rate. Notice how the success sits around 20%.

The Critical Gap: From Simple Task Automation to Complex Work Completion

According to this study, most existing approaches in building digital AI Agents rely heavily on proprietary large language model (LLM) APIs.

While some efforts have been made to train models for computer operation, these have primarily focused on improving basic visual grounding capabilities or addressing specific domains like web interaction.

However, these methods still struggle to handle complex, real-world computer work.

Two major challenges have been identified:

Foundational Visual Grounding — Current vision-language models still face difficulty precisely locating GUI elements, such as the Start button in the taskbar.

Complex Cognitive Understanding — More importantly, current agents lack the cognitive abilities needed for complex tasks. They struggle with maintaining context over extended interactions, making dynamic decisions in changing environments, and adapting strategies based on real-time outcomes.

While techniques like prompt engineering provide partial solutions, they fall short when it comes to complex, dynamic work environments.

The path from executing simple tasks to handling complex work lies in efficiently capturing and learning from human cognitive processes during computer use.

PC Tracker: A lightweight infrastructure designed to efficiently gather high-quality human-computer interaction trajectories, capturing the full cognitive context.

Two-Stage Cognition Completion Pipeline: A process that transforms raw interaction data into detailed cognitive trajectories by enriching it with action semantics and thought processes.

Multi-Agent System: This system integrates a planning agent for decision-making with a grounding agent to ensure robust visual grounding.

Human Cognition: The Missing Key

The key to overcoming this gap lies in transferring human cognition to AI Agents.

When humans engage in complex work, their brain undergoes sophisticated cognitive processes — understanding objectives, analysing current states, reflecting on past actions, and planning future strategies.

These cognitive processes lead to decisions, which are then externalised into observable actions.

Recognising this, a novel framework has been developed that captures and transfers human cognitive processes into AI agents, enabling them to handle more complex tasks with greater accuracy and adaptability.

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.

COBUS GREYLING

Where AI Meets Language | Language Models, AI Agents, Agentic Applications, Development Frameworks & Data-Centric…

www.cobusgreyling.com

PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World

Imagine a world where AI can handle your work while you sleep - organizing your research materials, drafting a report…

arxiv.org