Agentic Discovery
Web-Navigating AI Agents: Redefining Online Interactions and Shaping the Future of Autonomous Exploration.
Introduction
What are AI agents or Agentic Applications? Well, this is the best definition I could come up with:
๐๐ฏ ๐๐ ๐๐จ๐ฆ๐ฏ๐ต ๐ช๐ด ๐ข ๐ด๐ฐ๐ง๐ต๐ธ๐ข๐ณ๐ฆ ๐ฑ๐ณ๐ฐ๐จ๐ณ๐ข๐ฎ ๐ฅ๐ฆ๐ด๐ช๐จ๐ฏ๐ฆ๐ฅ ๐ต๐ฐ ๐ฑ๐ฆ๐ณ๐ง๐ฐ๐ณ๐ฎ ๐ต๐ข๐ด๐ฌ๐ด ๐ฐ๐ณ ๐ฎ๐ข๐ฌ๐ฆ ๐ฅ๐ฆ๐ค๐ช๐ด๐ช๐ฐ๐ฏ๐ด ๐ข๐ถ๐ต๐ฐ๐ฏ๐ฐ๐ฎ๐ฐ๐ถ๐ด๐ญ๐บ ๐ฃ๐ข๐ด๐ฆ๐ฅ ๐ฐ๐ฏ ๐ต๐ฉ๐ฆ ๐ต๐ฐ๐ฐ๐ญ๐ด ๐ต๐ฉ๐ข๐ต ๐ข๐ณ๐ฆ ๐ข๐ท๐ข๐ช๐ญ๐ข๐ฃ๐ญ๐ฆ.
๐๐ด ๐ด๐ฉ๐ฐ๐ธ๐ฏ ๐ช๐ฏ ๐ต๐ฉ๐ฆ ๐ช๐ฎ๐ข๐จ๐ฆ ๐ฃ๐ฆ๐ญ๐ฐ๐ธ, ๐ข๐จ๐ฆ๐ฏ๐ต๐ด ๐ณ๐ฆ๐ญ๐บ ๐ฐ๐ฏ ๐ฐ๐ฏ๐ฆ ๐ฐ๐ณ ๐ฎ๐ฐ๐ณ๐ฆ ๐๐ข๐ณ๐จ๐ฆ ๐๐ข๐ฏ๐จ๐ถ๐ข๐จ๐ฆ ๐๐ฐ๐ฅ๐ฆ๐ญ๐ด ๐ฐ๐ณ ๐๐ฐ๐ถ๐ฏ๐ฅ๐ข๐ต๐ช๐ฐ๐ฏ ๐๐ฐ๐ฅ๐ฆ๐ญ๐ด ๐ต๐ฐ ๐ฃ๐ณ๐ฆ๐ข๐ฌ ๐ฅ๐ฐ๐ธ๐ฏ ๐ค๐ฐ๐ฎ๐ฑ๐ญ๐ฆ๐น ๐ต๐ข๐ด๐ฌ๐ด ๐ช๐ฏ๐ต๐ฐ ๐ฎ๐ข๐ฏ๐ข๐จ๐ฆ๐ข๐ฃ๐ญ๐ฆ ๐ด๐ถ๐ฃ-๐ต๐ข๐ด๐ฌ๐ด.
๐๐ฉ๐ฆ๐ด๐ฆ ๐ด๐ถ๐ฃ-๐ต๐ข๐ด๐ฌ๐ด ๐ข๐ณ๐ฆ ๐ฐ๐ณ๐จ๐ข๐ฏ๐ช๐ด๐ฆ๐ฅ ๐ช๐ฏ๐ต๐ฐ ๐ข ๐ด๐ฆ๐ฒ๐ถ๐ฆ๐ฏ๐ค๐ฆ ๐ฐ๐ง ๐ข๐ค๐ต๐ช๐ฐ๐ฏ๐ด ๐ต๐ฉ๐ข๐ต ๐ต๐ฉ๐ฆ ๐ข๐จ๐ฆ๐ฏ๐ต ๐ค๐ข๐ฏ ๐ฆ๐น๐ฆ๐ค๐ถ๐ต๐ฆ.
๐๐ฉ๐ฆ ๐ข๐จ๐ฆ๐ฏ๐ต ๐ข๐ญ๐ด๐ฐ ๐ฉ๐ข๐ด ๐ข๐ค๐ค๐ฆ๐ด๐ด ๐ต๐ฐ ๐ข ๐ด๐ฆ๐ต ๐ฐ๐ง ๐ฅ๐ฆ๐ง๐ช๐ฏ๐ฆ๐ฅ ๐ต๐ฐ๐ฐ๐ญ๐ด, ๐ฆ๐ข๐ค๐ฉ ๐ธ๐ช๐ต๐ฉ ๐ข ๐ฅ๐ฆ๐ด๐ค๐ณ๐ช๐ฑ๐ต๐ช๐ฐ๐ฏ ๐ต๐ฉ๐ข๐ต ๐ฉ๐ฆ๐ญ๐ฑ๐ด ๐ช๐ต ๐ฅ๐ฆ๐ต๐ฆ๐ณ๐ฎ๐ช๐ฏ๐ฆ ๐ธ๐ฉ๐ฆ๐ฏ ๐ข๐ฏ๐ฅ ๐ฉ๐ฐ๐ธ ๐ต๐ฐ ๐ถ๐ด๐ฆ ๐ต๐ฉ๐ฆ๐ด๐ฆ ๐ต๐ฐ๐ฐ๐ญ๐ด ๐ช๐ฏ ๐ด๐ฆ๐ฒ๐ถ๐ฆ๐ฏ๐ค๐ฆ ๐ต๐ฐ ๐ข๐ฅ๐ฅ๐ณ๐ฆ๐ด๐ด ๐ค๐ฉ๐ข๐ญ๐ญ๐ฆ๐ฏ๐จ๐ฆ๐ด ๐ข๐ฏ๐ฅ ๐ณ๐ฆ๐ข๐ค๐ฉ ๐ข ๐ง๐ช๐ฏ๐ข๐ญ ๐ค๐ฐ๐ฏ๐ค๐ญ๐ถ๐ด๐ช๐ฐ๐ฏ.
Understanding Agents
One thing I find frustrating is the misguided content and commentary surrounding AI Agents and agentic applications โ what they are and what they arenโt.
While itโs easy to theorise and speculate about the future, such discussions are often ungrounded. The best approach is to base our understanding on recent research and to be familiar with the current technologies and frameworks available.
To truly grasp what AI Agents are, itโs essential to build one yourself. The simplest way to start is by copying the code provided below and running it in a Colab notebook. As you execute each segment, your understanding will deepen. Then, try modifying details in the code, such as the model used from OpenAI, and run it again to see the effects.
Below is the complete Python code for the AI agent. The only adjustments youโll need to make are adding your OpenAI API key and LangSmith project variables.
### Install Required Packages:
pip install -qU langchain-openai langchain langchain_community langchain_experimental
pip install -U duckduckgo-search
pip install -U langchain langchain-openai
### Import Required Modules and Set Environment Variables:
import os
from uuid import uuid4
### Setup the LangSmith environment variables
unique_id = uuid4().hex[0:8]
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"OpenAI_SM_1"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "<LangSmith API Key Goes Here>"
### Import LangChain Components and OpenAI API Key
from langchain.chains import LLMMathChain
from langchain_community.utilities import DuckDuckGoSearchAPIWrapper
from langchain_core.tools import Tool
from langchain_experimental.plan_and_execute import (
PlanAndExecute,
load_agent_executor,
load_chat_planner,
)
from langchain_openai import ChatOpenAI, OpenAI
###
os.environ['OPENAI_API_KEY'] = str("<OpenAI API Key>")
llm = OpenAI(temperature=0,model_name='gpt-4o-mini')
### Set Up Search and Math Chain Tools
search = DuckDuckGoSearchAPIWrapper()
llm = OpenAI(temperature=0)
llm_math_chain = LLMMathChain.from_llm(llm=llm, verbose=True)
tools = [
Tool(
name="Search",
func=search.run,
description="useful for when you need to answer questions about current events",
),
Tool(
name="Calculator",
func=llm_math_chain.run,
description="useful for when you need to answer questions about math",
),
]
### Initialize Planner and Executor
model = ChatOpenAI(model_name='gpt-4o-mini', temperature=0)
planner = load_chat_planner(model)
executor = load_agent_executor(model, tools, verbose=True)
agent = PlanAndExecute(planner=planner, executor=executor)
### Invoke the Agent
agent.invoke(
"Who is the founder of SpaceX an what is the square root of his year of birth?"
)
Web-Navigating โ The Next Frontier
As agents grow in capability, they are also expanding into navigating by leveraging the image / visual capabilities of Language Models.
Firstly, language models with vision capabilities significantly enhance AI agents by incorporating an additional modality, enabling them to process and understand visual information alongside text.
Iโve often considered the most effective use cases for multi-modal models, is applying them in agent applications that require visual input is a prime example.
Secondly, recent developments such as Appleโs Ferrit-UI, AppAgent v2 and the WebVoyager/LangChain implementation showcase how GUI elements can be mapped and defined using named bounding boxes, further advancing the integration of vision in agent-driven tasks.
WebPilot
The code will be publicly available at github.com/WebPilot.
In general, the initial goal was to enable agents to break down complex and ambiguous questions into smaller, manageable steps that can be solved sequentially, much like humans do.
Followed by the development of independent tools that can be integrated to enhance the agentโs capabilities. Each agent is identified by a description that outlines its specific abilities and functionalities.
WebPilot aims to extend the capabilities of agents by enabling them to explore the web via a web browser.
Currently, agents are expanding in two key areas: web exploration through a browser and interpreting web pages.
The second area of focus is mobile operating systems, where agents are being developed to operate effectively.
Considering the image above, it illustrates how the WebPilot takes different steps from its decomposition process, and explores the web for answers.
To fully harness this potential, these agents must excel in tasks such as complex information retrieval, long-horizon task execution, and the integration of diverse information sources. ~ Source
Planner, Controller, Extractor
Specifically, the Global Optimisation phase is driven by Planner, Controller, Extractor & Verifier
The Planner simplifies complex tasks by breaking them into smaller, manageable steps, helping to focus on specific actions and tackle the challenges of traditional MCTS (Monte Carlo Tree Search).
Reflective Task Adjustment (RTA) then fine-tunes the plan using new observations, enabling WebPilot to adapt as needed.
The Controller monitors subtask progress, evaluating completion and generating reflections if re-execution is needed, ensuring accurate and adaptive task completion.
Throughout this process, Extractor collects essential information to aid in task execution. This coordinated approach ensures that WebPilot remains adaptable and efficient in dynamic environments.
Conclusion
Although this study is at the forefront of agentic applications, the framework itself feels somewhat opaque to me, and I donโt fully grasp all the concepts.
Once the code is released and working prototypes can be built, the approach and framework should become clearer.
โจโจ Follow me on LinkedIn for updates โจโจ
Iโm currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.