Agentic Discovery

Web-Navigating AI Agents: Redefining Online Interactions and Shaping the Future of Autonomous Exploration.

6 min readAug 30, 2024

Introduction

What are AI agents or Agentic Applications? Well, this is the best definition I could come up with:

𝘈𝘯 𝘈𝘐 𝘈𝘨𝘦𝘯𝘵 𝘪𝘴 𝘢 𝘴𝘰𝘧𝘵𝘸𝘢𝘳𝘦 𝘱𝘳𝘰𝘨𝘳𝘢𝘮 𝘥𝘦𝘴𝘪𝘨𝘯𝘦𝘥 𝘵𝘰 𝘱𝘦𝘳𝘧𝘰𝘳𝘮 𝘵𝘢𝘴𝘬𝘴 𝘰𝘳 𝘮𝘢𝘬𝘦 𝘥𝘦𝘤𝘪𝘴𝘪𝘰𝘯𝘴 𝘢𝘶𝘵𝘰𝘯𝘰𝘮𝘰𝘶𝘴𝘭𝘺 𝘣𝘢𝘴𝘦𝘥 𝘰𝘯 𝘵𝘩𝘦 𝘵𝘰𝘰𝘭𝘴 𝘵𝘩𝘢𝘵 𝘢𝘳𝘦 𝘢𝘷𝘢𝘪𝘭𝘢𝘣𝘭𝘦.

𝘈𝘴 𝘴𝘩𝘰𝘸𝘯 𝘪𝘯 𝘵𝘩𝘦 𝘪𝘮𝘢𝘨𝘦 𝘣𝘦𝘭𝘰𝘸, 𝘢𝘨𝘦𝘯𝘵𝘴 𝘳𝘦𝘭𝘺 𝘰𝘯 𝘰𝘯𝘦 𝘰𝘳 𝘮𝘰𝘳𝘦 𝘓𝘢𝘳𝘨𝘦 𝘓𝘢𝘯𝘨𝘶𝘢𝘨𝘦 𝘔𝘰𝘥𝘦𝘭𝘴 𝘰𝘳 𝘍𝘰𝘶𝘯𝘥𝘢𝘵𝘪𝘰𝘯 𝘔𝘰𝘥𝘦𝘭𝘴 𝘵𝘰 𝘣𝘳𝘦𝘢𝘬 𝘥𝘰𝘸𝘯 𝘤𝘰𝘮𝘱𝘭𝘦𝘹 𝘵𝘢𝘴𝘬𝘴 𝘪𝘯𝘵𝘰 𝘮𝘢𝘯𝘢𝘨𝘦𝘢𝘣𝘭𝘦 𝘴𝘶𝘣-𝘵𝘢𝘴𝘬𝘴.

𝘛𝘩𝘦𝘴𝘦 𝘴𝘶𝘣-𝘵𝘢𝘴𝘬𝘴 𝘢𝘳𝘦 𝘰𝘳𝘨𝘢𝘯𝘪𝘴𝘦𝘥 𝘪𝘯𝘵𝘰 𝘢 𝘴𝘦𝘲𝘶𝘦𝘯𝘤𝘦 𝘰𝘧 𝘢𝘤𝘵𝘪𝘰𝘯𝘴 𝘵𝘩𝘢𝘵 𝘵𝘩𝘦 𝘢𝘨𝘦𝘯𝘵 𝘤𝘢𝘯 𝘦𝘹𝘦𝘤𝘶𝘵𝘦.

𝘛𝘩𝘦 𝘢𝘨𝘦𝘯𝘵 𝘢𝘭𝘴𝘰 𝘩𝘢𝘴 𝘢𝘤𝘤𝘦𝘴𝘴 𝘵𝘰 𝘢 𝘴𝘦𝘵 𝘰𝘧 𝘥𝘦𝘧𝘪𝘯𝘦𝘥 𝘵𝘰𝘰𝘭𝘴, 𝘦𝘢𝘤𝘩 𝘸𝘪𝘵𝘩 𝘢 𝘥𝘦𝘴𝘤𝘳𝘪𝘱𝘵𝘪𝘰𝘯 𝘵𝘩𝘢𝘵 𝘩𝘦𝘭𝘱𝘴 𝘪𝘵 𝘥𝘦𝘵𝘦𝘳𝘮𝘪𝘯𝘦 𝘸𝘩𝘦𝘯 𝘢𝘯𝘥 𝘩𝘰𝘸 𝘵𝘰 𝘶𝘴𝘦 𝘵𝘩𝘦𝘴𝘦 𝘵𝘰𝘰𝘭𝘴 𝘪𝘯 𝘴𝘦𝘲𝘶𝘦𝘯𝘤𝘦 𝘵𝘰 𝘢𝘥𝘥𝘳𝘦𝘴𝘴 𝘤𝘩𝘢𝘭𝘭𝘦𝘯𝘨𝘦𝘴 𝘢𝘯𝘥 𝘳𝘦𝘢𝘤𝘩 𝘢 𝘧𝘪𝘯𝘢𝘭 𝘤𝘰𝘯𝘤𝘭𝘶𝘴𝘪𝘰𝘯.

Understanding Agents

One thing I find frustrating is the misguided content and commentary surrounding AI Agents and agentic applications — what they are and what they aren’t.

While it’s easy to theorise and speculate about the future, such discussions are often ungrounded. The best approach is to base our understanding on recent research and to be familiar with the current technologies and frameworks available.

To truly grasp what AI Agents are, it’s essential to build one yourself. The simplest way to start is by copying the code provided below and running it in a Colab notebook. As you execute each segment, your understanding will deepen. Then, try modifying details in the code, such as the model used from OpenAI, and run it again to see the effects.

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don’t already…

cobusgreyling.medium.com

Below is the complete Python code for the AI agent. The only adjustments you’ll need to make are adding your OpenAI API key and LangSmith project variables.

### Install Required Packages:
pip install -qU langchain-openai langchain langchain_community langchain_experimental
pip install -U duckduckgo-search
pip install -U langchain langchain-openai
### Import Required Modules and Set Environment Variables:
import os
from uuid import uuid4
### Setup the LangSmith environment variables
unique_id = uuid4().hex[0:8]
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"OpenAI_SM_1"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "<LangSmith API Key Goes Here>"
### Import LangChain Components and OpenAI API Key
from langchain.chains import LLMMathChain
from langchain_community.utilities import DuckDuckGoSearchAPIWrapper
from langchain_core.tools import Tool
from langchain_experimental.plan_and_execute import (
    PlanAndExecute,
    load_agent_executor,
    load_chat_planner,
)
from langchain_openai import ChatOpenAI, OpenAI
###
os.environ['OPENAI_API_KEY'] = str("<OpenAI API Key>")
llm = OpenAI(temperature=0,model_name='gpt-4o-mini')
### Set Up Search and Math Chain Tools
search = DuckDuckGoSearchAPIWrapper()
llm = OpenAI(temperature=0)
llm_math_chain = LLMMathChain.from_llm(llm=llm, verbose=True)
tools = [
    Tool(
        name="Search",
        func=search.run,
        description="useful for when you need to answer questions about current events",
    ),
    Tool(
        name="Calculator",
        func=llm_math_chain.run,
        description="useful for when you need to answer questions about math",
    ),
]
### Initialize Planner and Executor
model = ChatOpenAI(model_name='gpt-4o-mini', temperature=0)
planner = load_chat_planner(model)
executor = load_agent_executor(model, tools, verbose=True)
agent = PlanAndExecute(planner=planner, executor=executor)
### Invoke the Agent
agent.invoke(
    "Who is the founder of SpaceX an what is the square root of his year of birth?"
)

Web-Navigating — The Next Frontier

As agents grow in capability, they are also expanding into navigating by leveraging the image / visual capabilities of Language Models.

Firstly, language models with vision capabilities significantly enhance AI agents by incorporating an additional modality, enabling them to process and understand visual information alongside text.

I’ve often considered the most effective use cases for multi-modal models, is applying them in agent applications that require visual input is a prime example.

Secondly, recent developments such as Apple’s Ferrit-UI, AppAgent v2 and the WebVoyager/LangChain implementation showcase how GUI elements can be mapped and defined using named bounding boxes, further advancing the integration of vision in agent-driven tasks.

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don't already…

cobusgreyling.medium.com

WebPilot

The code will be publicly available at github.com/WebPilot.

In general, the initial goal was to enable agents to break down complex and ambiguous questions into smaller, manageable steps that can be solved sequentially, much like humans do.

Followed by the development of independent tools that can be integrated to enhance the agent’s capabilities. Each agent is identified by a description that outlines its specific abilities and functionalities.

WebPilot aims to extend the capabilities of agents by enabling them to explore the web via a web browser.

Currently, agents are expanding in two key areas: web exploration through a browser and interpreting web pages.

The second area of focus is mobile operating systems, where agents are being developed to operate effectively.

Considering the image above, it illustrates how the WebPilot takes different steps from its decomposition process, and explores the web for answers.

To fully harness this potential, these agents must excel in tasks such as complex information retrieval, long-horizon task execution, and the integration of diverse information sources. ~ Source

Planner, Controller, Extractor

Specifically, the Global Optimisation phase is driven by Planner, Controller, Extractor & Verifier

The Planner simplifies complex tasks by breaking them into smaller, manageable steps, helping to focus on specific actions and tackle the challenges of traditional MCTS (Monte Carlo Tree Search).

Reflective Task Adjustment (RTA) then fine-tunes the plan using new observations, enabling WebPilot to adapt as needed.

The Controller monitors subtask progress, evaluating completion and generating reflections if re-execution is needed, ensuring accurate and adaptive task completion.

Throughout this process, Extractor collects essential information to aid in task execution. This coordinated approach ensures that WebPilot remains adaptable and efficient in dynamic environments.

Conclusion

Although this study is at the forefront of agentic applications, the framework itself feels somewhat opaque to me, and I don’t fully grasp all the concepts.

Once the code is released and working prototypes can be built, the approach and framework should become clearer.

✨✨ Follow me on LinkedIn for updates ✨✨

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don't already…

cobusgreyling.medium.com

WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic…

LLM-based autonomous agents often fail to execute complex web tasks that require dynamic interaction due to the…

arxiv.org

COBUS GREYLING

At the intersection of AI & Language | NLP/NLU/LLM, Chat/Voicebots, CCAI Chief Evangelist @ Kore AI. I explore and…

www.cobusgreyling.com

Agentic Discovery

Web-Navigating AI Agents: Redefining Online Interactions and Shaping the Future of Autonomous Exploration.

Introduction

Understanding Agents

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don’t already…

Web-Navigating — The Next Frontier

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don't already…

WebPilot

Planner, Controller, Extractor

Conclusion

✨✨ Follow me on LinkedIn for updates ✨✨

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don't already…

WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic…

LLM-based autonomous agents often fail to execute complex web tasks that require dynamic interaction due to the…

COBUS GREYLING

At the intersection of AI & Language | NLP/NLU/LLM, Chat/Voicebots, CCAI Chief Evangelist @ Kore AI. I explore and…

Written by Cobus Greyling

Responses (1)