An AI Agent Architecture & Framework Is Emerging

We are beginning to see the convergence on fundamental architectural principles that are poised to define the next generation of AI agents…

7 min read3 days ago

--

These architectures are far more than just advanced models — there are definitive building blocks emerging that will enable AI Agents & Agentic Applications to act autonomously, adapt dynamically, and interact and explore seamlessly within digital environments.

And as AI Agents become more capable, builders are converging on the common principles and approaches for core components.

I want to add a caveat: while there’s plenty of futuristic speculation around AI Agents, Agentic Discovery, and Agentic Applications, the insights and comments I share here are grounded in concrete research papers and hands-on experience with prototypes that I’ve either built or forked and tested in my own environment.

But First, Let’s Set The Stage With Some Key Concepts…

What Are AI Agents?

At a high level, an AI Agent is a system designed to perform tasks autonomously or semi-autonomously. Considering semi-autonomous for a moment, agents make use of tools to achieve their objective, and a human-in-the-loop can be a tool.

AI Agent tasks can range from a virtual assistant that schedules your appointments, to more complex agents involved in exploring and interacting with digital environments. With regards to digital environments, the most prominent research is from Apple with Ferret-UI, WebVoyager, and research from Microsoft and others; as seen below…

An AI Agent is a program that uses one or more Large Language Models (LLMs) or Foundation Models (FMs) as its backbone, enabling it to operate autonomously.

By decomposing queries, planning & creating a sequence of events, the AI Agent effectively addresses and solves complex problems.

AI Agents can handle highly ambiguous questions by decomposing them through a chain of thought process, similar to human reasoning.

These agents have access to a variety of tools, including programs, APIs, web searches, and more, to perform tasks and find solutions.

Large Action Models (LAMs)

Much like how Large language models (LLMs) transformed natural language processing, Large Action Models (LAMs) are poised to revolutionise the way AI agents interact with their environments.

In a recent piece I wrote, I explored the emergence of Large Action Models (LAMs) and their future impact on AI Agents.

Salesforce AI Research open-sourced a number of LAMs, including a Small Action Model.

LAMs are designed to go beyond simple language generation by enabling AI to take meaningful actions in real-world scenarios.

Function calling has become a crucial element in the context of AI Agents, particularly from a model capability standpoint, because it significantly extends the functionality of large language models (LLMs) beyond text generation.

And hence one of the reasons for the advent of Large Action Models which has as one of its main traits the ability to excel at function calling.

AI Agents often need to perform actions based on user input, such as retrieving information, scheduling tasks, or performing computations.

Function calling allows the model to generate parameters for these tasks, enabling the agent to trigger external processes like database queries or API calls.

Model Orchestration & Leveraging Small Language Models

While LAMs form the action backbone, model orchestration brings together smaller, more specialised language models (SLMs) to assist in niche tasks.

Instead of relying solely on massive, resource-heavy models, agents can utilise these smaller models in tandem, orchestrating them for specific functions — whether that’s summarising data, parsing user commands, or providing insights based on historical context.

Small Language Models are ideal for development and testing, running them in an offline mode locally.

Large Language Models (LLMs) have rapidly gained traction due to several key characteristics that align well with the demands of natural language processing. These characteristics include natural language generation, common-sense reasoning, dialogue and conversation context management, natural language understanding, and the ability to handle unstructured input data. While LLMs are knowledge-intensive and have proven to be powerful tools, they are not without their limitations.

One significant drawback of LLMs is their tendency to hallucinate, meaning they can generate responses that are coherent, contextually accurate, and plausible, yet factually incorrect.

Additionally, LLMs are constrained by the scope of their training data, which has a fixed cut-off date. This means they do not possess ongoing, up-to-date knowledge or specific insights tailored to particular industries, organizations, or companies.

Updating an LLM to address these gaps is not straightforward; it requires fine-tuning the base model, which involves considerable effort in data preparation, costs, and testing. This process introduces a non-transparent, complex approach to data integration within LLMs.

To address these shortcomings, the concept of Retrieval-Augmented Generation (RAG) has been introduced.

RAG helps bridge the gap for Small Language Models (SLMs), supplementing them with the deep, intensive knowledge capabilities they typically lack.

While SLMs inherently manage other key aspects such as language generation and understanding, RAG equips them to perform comparably to their larger counterparts by enhancing their knowledge base.

This makes RAG a critical equalizer in the realm of AI language models, allowing smaller models to function with the robustness of a full-scale LLM.

Vision-Enabled Language Models For Digital Exploration

As AI Agents gain capabilities to explore and interact with digital environments, the integration of vision capabilities with language models becomes crucial.

Projects like Ferret-UI from Apple and WebVoyager are excellent examples of this.

These agents can navigate within their digital surroundings, whether that means identifying elements on a user interface or exploring websites autonomously.

Imagine an AI Agent tasked with setting up an application in a new environment — it would not only read text-based instructions but also recognise UI elements via OCR, mapping bounding boxes and interpreting text to interact with them, and provide visual feedback.

Function Calling & Structured Output

A fundamental shift is happening in how AI agents handle inputs and outputs.

Traditionally, LLMs have operated with unstructured input and generated unstructured output — short to paragraphs of text or responses. But now, with function calling, we are moving toward structured, actionable outputs.

While LLMs are great for understanding and producing unstructured content, LAMs are designed to bridge the gap by turning language into structured, executable actions.

When an AI Agent can structure its output to align with specific functions, it can interact with other systems far more effectively.

For instance, instead of generating a merely unstructured/conversational text response, the AI could call a specific function to book a meeting, send a request, or trigger an API call — all within a more efficient token usage.

Not only does this reduce the overhead of processing unstructured responses, but it also makes interactions between systems more seamless.

Something to realise in terms of Function Calling, is that when using the OpenAI API with function calling, the model does not execute functions directly.

AI Agents can now become truly part of the larger digital ecosystem.

The Role of Tools: Pipelines & Human-in-the-Loop

Finally, let’s talk about the importance of tools in the architecture of AI agents.

Tools can be thought of as the mechanisms through which AI Agents interact with the world — whether that’s fetching data, performing calculations, or executing tasks. In many ways, these tools are like pipelines, carrying inputs from one stage to another, transforming them along the way.

What’s even more fascinating is that a tool doesn’t necessarily have to be an algorithm or script. In some cases, the tool can be a human-in-the-loop, where humans intervene at key moments to guide or validate the agent’s actions.

This is particularly valuable in high-stakes environments, such as healthcare or finance, where absolute accuracy is critical.

Tools not only extend the capabilities of AI agents but also serve as the glue that holds various systems together. Whether it’s a human or a digital function, these tools allow AI agents to become more powerful, modular, and context-aware.

The Future of AI Agents

As we stand at the cusp of this new era, it’s clear that AI agents are becoming far more sophisticated than we ever anticipated.

With Large Action Models, Model Orchestration, vision-enabled language models, Function Calling, and the critical role of Tools, these agents are active participants in solving problems, exploring digital landscapes, and learning autonomously.

By focusing on these core building blocks, we’re setting the foundation for AI agents that are not just smarter, but more adaptable, efficient, and capable of acting in ways that starts to resemble human problem solving and thought processes.

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn

--

--

I explore and write about all things at the intersection of AI & language; LLMs/NLP/NLU, Chat/Voicebots, CCAI. www.cobusgreyling.com