The Hard Truth About AI Agents & Accuracy
The length of tasks (measured by how long they take human professionals) that generalist autonomous frontier model agents can complete with 80% reliability has been doubling approximately every 213 days…
With all the excitement around AI Agents there are two areas which are being neglected…safety and accuracy. In this article I want to take a look at the state of AI Agent accuracy.
Introduction
How accurate are AI Agents? Well, it depends…there are a number of things to consider…
The Tasks
- Are the tasks more general or specific?
- How long would it take a human specialist to complete?
- How complex is the task? How many steps are involved?
- What tools must the AI make use of, does it include web browser, navigating an Operating System?
- There are general tasks and more domain specific tasks.
- As you will see later in this article, there are a number of benchmarks which AI Agents and models are currently testing against. Researchers overfitting their tests to score high on benchmarking tests is a reality. I was really surprised when I saw the scores of OpenAI’s Computer Using Agent compared to other Agents.
Compare the benchmarking below from OpenAI…
To the benchmarking from the Agent S2 research…
AI Agent Definition
- Complexity needs to reside somewhere, Language Models are growing in capability to accommodate functionality like functions, Computer Vision and other tool integrations. Think of OpenAI’s AI Agent SDK and their Computer Vision Agent framework.
- Language models are growing in capabilities of task decomposition, reasoning, grounding, web search, information retrieval and synthesis.
- But in the long run organisations will make use of frameworks in which decompose functionality and follows a more granular approach.
- Good open-source examples are LangChain and LlamaIndex.
- For more enterprise or domain specific knowledge work, a more flexible and granularly controllable framework would be required.
Accuracy & Supervision
- Task completion accuracy and success are important, but a neglected part I believe is human supervision.
- There is an element where AI Agents can operate under weak human supervision, and a minor deviation can be corrected or adjusted by the human.
- And then there is cost, as you will see later in this article…there is a balance
Speech Recognition Trajectory
Taking a look back at the recent past of technology…
Accuarte Speech Recognition is something we take for granted now; especially with voice enabled apps like Grok, ChatGPT and others. SR is imbedded, baked in and we give it no consideration.
Considering the image below, I can remember in 2017 when it was illustrated how the word accuracy rate of Google ML Voice Recognition exceeded that of human level. With the human accuracy threshold being at 95%, and Google ASR reaching 95%+.
It took a few years, but ASR got to the level it needs to be.
Research from the Agent S2 Project
Considering the image below, even the top performer, Similar Agent S2, is only hitting a 34.5% success rate when given 50 steps.It means over 60% of the time, it’s still fails.
You can also see a pretty big gap between specialised agents like Similar Agent S2 and more general AI assistants like Claude. It’s clear that those general models have a long way to go if they want to get good at handling computer tasks.
Now, also look at how all the lines shoot up as the number of steps increases — more steps definitely help these agents get better results, but it also shows they’re not very efficient. They often need a bunch of tries to get things right, and that means more time and higher costs, which isn’t ideal.
When you limit them to just 15 steps, though, they all struggle — success rates are down around 15%. It really highlights how tough it is for them to handle complex, multi-step tasks without a lot of room to work.
And even with more steps, they’re still topping out in the mid-30% range.
Compared to humans, who’d probably nail these tasks close to 95% of the time, these AI agents have a ton of catching up to do before they can reliably take over our everyday computer work. It’s a bit of a wake-up call, don’t you think?
Humans Against AI Agents
A new way was introduce to measure AI capabilities by comparing them to human performance.
The key idea is the 50%-task-completion time horizon, which is how long it takes humans to do tasks that an AI can complete with 50% success.
They tested this with experts doing tasks from various benchmarks, including 66 new short tasks.
Current top AI models, like Claude 3.7 Sonnet, match human performance on tasks that take humans about 50 minutes.
Since 2019, AI has been improving fast — doubling its time horizon every seven months, possibly even faster in 2024.
This improvement comes from better reliability, mistake correction, reasoning, and tool use.
However, the study notes limitations, like whether these results apply to real-world tasks, and raises concerns about AI autonomy leading to risky capabilities.
In 2019, with GPT-2, AI matched human performance on tasks that took humans about 2 seconds.
By 2022, with GPT-4–0314, this jumped to tasks taking humans around 8 minutes.
By 2024, with Claude 3.7 Sonnet, AI reached tasks that take humans about 30 minutes.
The trend shows AI’s capability doubling every 7 months (as indicated by the “Doubling time” note.
Below a graph for 80% accuracy….
If this trend continues, in five years (2030), AI could automate software tasks that take humans 3 months. .
Real-World Knowledge Work Is Messy
Real-world knowledge work is often involves complex details that benchmarks typically exclude, such as being under-specified or poorly scoped, having unclear feedback loops or success criteria, or requiring coordination between multiple streams of work in real-time.
The study generally observed that agents struggle more with tasks containing these “messy” details . This raises the question of whether agents exhibited similar rates of improvement on “less messy” tasks compared to “more messy” ones.
For autonomously generating large economic value, the study extrapolates that within 5 years from the study’s context (March 2024), so by March 2029, AI systems could automate software tasks that currently take humans a month.
Human Cost & AI Agent Cost
This scatter plot illustrates the relationship between task length and the cost ratio of using AI models versus human labor for 1,460 tasks.
The x-axis represents task length (the time humans take to complete tasks), ranging from 1 second to 1 day on a logarithmic scale.
The y-axis shows the “Model Cost / Human Cost” ratio, also on a logarithmic scale,
The graph suggests that AI is generally more cost-effective than humans for shorter tasks, but as task length increases, the cost advantage of AI diminishes, with costs becoming more comparable for tasks taking several hours or a full day.
This aligns with the study’s focus on AI’s growing capability to handle longer tasks, potentially impacting economic value as AI becomes viable for more complex, time-intensive work.
Cost & Accuracy
This scatter plot compares the performance of various AI Agents on a task, plotting accuracy against cost in USD as of April 2024.
The agents are categorised into complex agents (red circles), baseline agents (purple crosses), and zero-shot models (green squares), with labels indicating the underlying models (e.g., GPT-4, GPT-3.5) and techniques.
The Pareto frontier (dashed line) highlights the trade-off between accuracy and cost.
Trade-off Between Accuracy & Cost
The graph shows a clear trade-off along the Pareto frontier, where higher accuracy comes at a higher cost.
Performance of Complex Agents
Complex agents, especially those built on GPT-4, consistently outperform baseline and zero-shot models in accuracy, though they are more expensive.
Conclusion
AI Agents have critical challenge in balancing accuracy with practical deployment.
The graphs from studies demonstrate that while complex AI Agents built on models like GPT-4 can achieve high accuracy, they come with significantly higher costs, often reaching.
Additionally, the “50%-task-completion time horizon” shows AI doubling its capability every 7 months, projecting that by 2029, AI could automate month-long human tasks, potentially generating substantial economic value.
However, this progress also raises concerns about catastrophic risks, as AI approaches a 4-hour task horizon by 2027, where autonomy in complex tasks could lead to harmful outcomes if not carefully managed.
Ultimately, while AI agents are poised to revolutionise industries, their accuracy, cost, and safety must be meticulously addressed to ensure they deliver value without unintended consequences.
Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.