Sitemap
Press enter or click to view image in full size

AI Agents & Optimising Cost & Performance

5 min readOct 8, 2025

--

In short

A number of studies have covered the principle of a Pareto Frontier when it comes to AI Agents and the trade-off between accuracy and cost.

So it came down to choosing the most optimal balance between cost and accuracy when it came to models.

This, together with the idea that an AI Agent makes use of only one Language Model, serving as the backbone of the AI Agent for NLG (Natural Language Generation, Reasoning, context management, etc).

Then, OpenAI and NVIDIA introduced the notion of orchestrating multiple smaller models for specific tasks. For instance, within their AI Agent framework, NVIDIA fine tuned a small language model (SLM) for the sole purpose for increasing the accuracy of tool selection.

OpenAI sequence smaller language models in the deep research API and ChatGPT.

All of these approaches are rather static.

The research called Avengers-Pro, introduces a performance efficiency orchestrator which selects a language model based on the user input, for each and every dialog turn.

Avengers-Pro works like a smart traffic cop for AI queries.

It starts by embedding incoming prompts into semantic vectors using a lightweight model (Qwen3-embedding-8B), then clusters them into 60 semantically coherent groups based on a labeled dataset of query-answer pairs.

For each cluster, it computes a performance-efficiency score for every model.

This ensures that each interaction is optimised for cost and efficiency.

Press enter or click to view image in full size

Taking a Step Back

Last year, when everyone was talking about AI Agents, I published an article emphasising the importance of evaluating AI Agents not only on accuracy but also on cost efficiency (analysis based on a study by the name AI Agents That Matter).

I tried to highlight that many AI Agent implementations had become overly complex and expensive, often leading to flawed assessments of their effectiveness.

What I liked about this study, was how it plotted model behaviour in the context of AI Agents on a Pareto frontier. A visualisation that balances accuracy against cost — to identify designs that achieve optimal trade-offs.

But the problem is that a single model was linked to an AI Agent…but what if you want to select a model which is optimised for the user input?

Press enter or click to view image in full size

Now, a new paper addresses these concerns directly…

In th study a test-time routing framework (named Avengers-Pro) that integrates eight prominent language models, ranging from cost-effective options like Qwen3 variants to advanced models such as GPT-5-medium and Claude-4.1-opus.

Press enter or click to view image in full size

The system dynamically routes queries to the most appropriate model, aligning with the need for cost-aware assessments while advancing toward deployable solutions.

It dynamically routes each incoming query to exactly one model from its ensemble (e.g., eight LLMs like Qwen3 variants up to GPT-5-medium) that’s optimised for the best trade-off between performance (accuracy) and cost (token-based expenses).

Avengers-Pro operates by first converting input prompts into semantic embeddings using a compact model (Qwen3-embedding-8B).

It then groups this into 60 clusters derived from a dataset of query-answer pairs.

For each cluster, the framework calculates a performance-efficiency score for each model, combining normalised accuracy on similar tasks with normalised token-based costs from APIs like OpenRouter.

The framework’s performance across six challenging benchmarks, including:

  • Reasoning
  • Knowledge and health
  • Coding
  • Agent tasks.

This study extends the principles outlined in my earlier article.

At that time, I noted that agent costs — from multiple LLM invocations to prompt adjustments — were often ignored, with systems of similar accuracy differing in expense.

Avengers-Pro incorporates cost as a core factor in routing, prioritising economical models like Gemini-2.5-flash for straightforward tasks and reserving premium options like GPT-5 for complex ones.

For AI agent applications, the benefits are substantial.

Lastly…

When it comes to AI Agents and Agentic AI, I think the assumptions around “assumed knowledge” are misaligned.

Add to this the fact that AI Agents have not, in most cases, faced the rigours of production implementations.

If you’re still reading, thank you for your time and interest — I truly hope this content has earned it.

Press enter or click to view image in full size

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. Language Models, AI Agents, Agentic Apps, Dev Frameworks & Data-Driven Tools shaping tomorrow.

Press enter or click to view image in full size

--

--

Cobus Greyling
Cobus Greyling

Written by Cobus Greyling

I’m passionate about exploring the intersection of AI & language. www.cobusgreyling.com

Responses (1)