AI Agents That Matter

Measuring AI Agents in terms of Accuracy & Cost. AI Agent research and development need to incorporate accuracy & inference cost effectiveness, which will open up a new perspective of agent design…

6 min readNov 29, 2024

Introduction

While reading this study, it was like a breath of fresh air…the hype around AI Agents during the last 6 months have been immense. Even-though practical, working ReAct (React & Reasoning) AI Agents have been around for almost two years.

However, while companies are trying to show-case their prowess and technical advancements, AIAgents have become needlessly complex and costly, and the community has reached mistaken conclusions about the sources of accuracy gains.

At present, Claude is state-of-the-art for models that use computers in the same way as a person does — that is, from looking at the screen and taking actions in response.
On one evaluation created to test developers’ attempts to have models use computers, OSWorld, Claude currently gets 14.9%. That’s nowhere near human-level skill (which is generally 70–75%), but it’s far higher than the 7.7% obtained by the next-best AI model in the same category. — Source

This study endeavours to add another metric to AI Agent evaluation, cost.

Cost Considerations

One consideration is that AI Agents call the underlying Language Models multiple times, increasing the cost considerably.

The approach of neglecting real-world implementation cost can encourage researchers to develop extremely costly agents just to claim they topped the leaderboard.

In the last year, many agents have been claimed to achieve state-of-the-art accuracy on coding tasks. But at what cost?

But what would a scale look like where accuracy and cost are compared, and how does an increase in cost scale against accuracy, and visa versa?

There are significant cost differences between agents. Agents with similar accuracy can vary in cost by almost two orders of magnitude.

Despite this, cost metrics are rarely highlighted in existing research.

Study Baseline

The cost and time requirements of running AI Agents were evaluated alongside the performance of several baseline approaches in terms of accuracy, cost, and runtime.

The baseline methods included:

Direct Use Of GPT-3.5 & GPT-4 without any agent architecture in a zero-shot setup.

Retry Strategy, where the model is invoked up to five times with a temperature of zero, allowing retries due to the non-deterministic behaviour of LLMs, even at zero temperature.

Warming Strategy, similar to the retry method, but gradually increasing the temperature from 0 to 0.5 across retries to introduce greater variability, aiming to improve success rates.

Escalation Strategy, beginning with a less expensive model (e.g., Llama-3 8B) and escalating to more advanced and costly models (e.g., GPT-3.5, Llama-3 70B, GPT-4) upon encountering failures in test cases.

These methods allowed for a comprehensive analysis of resource efficiency and effectiveness in achieving task goals.

The Pareto Frontier

The Pareto frontier represents the set of optimal solutions where no objective can be improved without worsening at least one other objective.

These solutions are considered “Pareto efficient.”

In simpler terms, imagine evaluating AI Agents based on two criteria, such as accuracy and cost.

A solution lies on the Pareto frontier if there isn’t another solution that is better in both accuracy and cost.

If one solution is more accurate but also more expensive, it might still be Pareto efficient because no alternative performs better on both dimensions simultaneously.

The Pareto frontier is often visualised as a curve or boundary in a two-dimensional graph, separating feasible solutions into optimal (on the frontier) and suboptimal (below the frontier) groups. This helps in making trade-offs between competing objectives.

There are instances where solutions are over-engineered and seemingly very technically astute, but not effective.

What is a System 2 Approach?

A System 2 approach refers to deliberate, reflective and logical thinking processes often contrasted with the intuitive, fast and automatic nature of System 1 thinking.

In AI, System 2 methods involve complex reasoning strategies, such as planning, reflection and debugging, requiring more computation and attention than simpler, reactive approaches.

Simplified Rewrite and Analysis

Evidence for System 2 approaches (like reflection and debugging) driving performance improvements in AI Agents is lacking.

Many studies fail to compare these complex methods with simple baselines, leading to misconceptions about their contributions to accuracy gains.

The findings from the study suggest that the role of these methods in tasks like code generation is still unclear, especially for simpler benchmarks.

Overconfidence in these approaches is further fuelled by reproducibility issues and inconsistent evaluations.

System 2 techniques might prove valuable for more challenging tasks, but this remains speculative. Evaluations should prioritise controlling for cost, as improvements in accuracy can often stem from trivial strategies like retries rather than meaningful advancements.

Without considering cost alongside accuracy, identifying true innovation in agent design is impossible.

Visualising Cost

From a business perspective, there needs to be a strong business use-case when implementing AI Agents, and this should revolve around proven cost savings or increase in revenue. Hence the operational cost of AI Agents are important.

Visualising the cost and accuracy of AI Agents as a Pareto frontier reveals opportunities to design agents that optimise both, leading to lower costs without sacrificing performance. This approach can also extend to other factors like latency.

The total cost of running an AI Agent consists of fixed and variable costs.

Fixed costs arise from one-time efforts like tuning parameters (e.g., temperature or prompt design) for a task, while variable costs depend on usage and scale with the number of input and output tokens.

Over time, variable costs tend to dominate as agents are used more frequently.

By focusing on joint optimisation, it is possible to balance fixed and variable costs. Investing upfront in refining agent design can lower ongoing operational costs, such as by creating shorter prompts or more efficient few-shot examples that maintain accuracy.

In Closing

Even-though a basic AI Agent technical architecture is settling, benchmarking AI Agents seem to be still in early stages and best practices have yet to be established, making it difficult to separate genuine progress from exaggerated claims.

AI Agents adds significant complexity to model based implementations, requiring a fresh approach to evaluation.

To address this, initial recommendations include cost-aware comparisons, separating model evaluation from downstream task performance, using proper hold-outs to prevent shortcuts, and standardising evaluation methods.

These steps aim to improve the rigour of agent benchmarking and create a solid foundation for future advancements.

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.