AI Agents Computer Interface (ACI)

A new class of AI Agents are evolving with the capability to understand and navigate a Graphic Computer Interface like a human would.

4 min readDec 4, 2024

--

Introduction

Recent advances in Foundation Models, especially Large Language Models (LLMs) and Multimodal Language Models (MLMs), have enabled AI Agents to complete complex tasks.

Some of these AI Agents with vision capabilities make use of MLMs to interpret and interact with Graphical User Interfaces (GUIs), emulating how a human would interact with a GUI. By performing actions like clicking and typing to fulfil user requests.

This study reviews and map the progress in AI Agent Computer Interfaces (ACI), focusing on innovations in data, frameworks and applications.

Graphical User Interfaces (GUIs)

Because Graphical User Interfaces (GUIs) act as the primary interaction points between humans and digital devices, it only makes sense to have an AI Agent emulate the human.

There is no need for API’s or additional application integration, as the user interface already exists.

Until recently, the idea of an AI Agent interacting with a GUI like a human seemed unimaginable.

This breakthrough became possible due to two key advancements:

  1. The development of AI Agents capable of reasoning and taking actions, and the integration of vision capabilities into Language Models.
  2. The vision functionality enables these models to analyse screenshots, providing a visual context for symbolic reasoning and effective interaction with graphical interfaces.

Framework

The matrix below shows the development of text-only and text & vision AI Agents over a period of time.

This matrix is a very good starting point to explore the existing ACI frameworks. It is important to note that an AI Agent is not a Language Model per se.

AI Agents have as their backbone one or more Language Models and in the instances where the model have vision capabilities, the model needs to provide the AI Agent with this capability.

An AI Agent is a framework which acts as an extension to the model, the image below shows the basic architecture of the Anthropic Computer Use AI Agent.

AI Agent Architecture

I personally like the term AI Agents Computer Interface (ACI), however the study uses the term GUI Agents…

In principle, GUI Agents are designed to automatically control devices to complete tasks specified by the user.

They process the user’s query and the device’s current UI status, then perform a series of human-like actions to achieve the desired outcome.

These agents typically consist of five key components:

GUI Perceiver, Analyses the device’s interface.
Task Planner, Breaks down the task into actionable steps.
Decision Maker, Chooses the best actions to take.
Memory Retriever, Accesses relevant past interactions or data.
Executor, Carries out the planned operations on the device.

Variations of this structure exist. For example, one framework includes specialised agents for planning, decision-making, and reflecting to navigate mobile device operations effectively.

Finally

Handling complex, multi-step tasks across diverse GUIs is particularly challenging due to the variability and dynamic nature of interfaces.

Additionally, inference efficiency is a critical factor.

Humans are highly sensitive to response times, with delays under 200 milliseconds generally considered acceptable, while longer delays can quickly degrade the user experience.

Current GUI agents often face inference and communication delays measured in seconds, significantly impacting user satisfaction. Minimising these delays or enabling (M)LLMs to run directly on mobile devices is an urgent challenge that must be addressed.

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.

--

--

Cobus Greyling
Cobus Greyling

Written by Cobus Greyling

I’m passionate about exploring the intersection of AI & language. www.cobusgreyling.com

Responses (1)