Internet Browsing AI Agents Demystified
To be truly effective, AI Agents need to start living in our environments, beginning in our digital environments is the most obvious choice.
For AI Agents to be truly of help to us as humans and perform tasks on our behalf they need to live and exist within our environment.
There has been a number of studies eluding to the fact that AI Agents will become the operating system of physical entities or robots.
Hence the AI Agent will have an embodiment to perform tasks.
But before we get here the most obvious place to begin implementing AI Agents and giving them a place to live would be within digital environments.
This digital environments can be customer support use-cases but also in personal digital environments.
So there has been a number of AI Agents that live on the user desktop and is able to perform computer use tasks.
Hence navigating the users graphic user interface (GUI).
It needs to be noted that the accuracy in general for multipurpose AI Agents navigating a desktop/GUI is in the vicinity of 14 to 30%.
A slightly easier problem to solve for is giving agents access to web browsing/search where we have seen accuracy of 50% and more.
The most astute approach to this is giving AI Agents only access to a curated list of websites it is capable of accessing.
This protect the AI Agent from the various attacks and pop-ups as it is shown by a studies that AI agents are still very naive when it comes to navigating to web.
Browser Use Project
Browser-use is the easiest way to connect your AI Agents with the browser…
The screen recording below shows how the Browser Use Agent is asked a question, and it goes off, open’s a browser and start searching the web for an answer.
What I like about this open-source project is that you can install and run it on your local machine. By getting a prototype up and running lends a level of understanding which cannot be achieved by merely reading documents.
If you are making use of a MacBook, the only tool you need is the Terminal application.
The next step is to create a virtual envirionment.
Creating a virtual environment is crucial for maintaining clean and isolated project dependencies, preventing conflicts between different projects that may require different versions of the same libraries.
By using a virtual environment, you can install project-specific packages without interfering with your global Python installation or other projects’ dependencies.
They provide a sandboxed space where you can experiment with different package versions and configurations without risking damage to your system-wide Python setup.
Additionally, virtual environments make it simple to manage and track exactly which packages and versions are required for a specific project, facilitating better dependency management and making your development workflow more organised and reproducible.
To create a virtual environment you do not need specialised software, with the command below, I created a virtual environment called browser
.
python3 -m venv browser
Then, I activate the virtual environment.
source browser/bin/activate
As seen in the image below, when you activate a virtual environment, your command line prompt will typically change to show the name of the virtual environment in parentheses at the beginning of the prompt, indicating that you are currently working within that isolated Python environment.
This visual cue helps you quickly recognise that you are using a specific virtual environment, which means any Python or pip commands you run will be specific to this environment’s packages and dependencies.
The environment name usually appears before your standard command line username and machine information, making it immediately clear which virtual environment is currently active.
Install the browser-use
application…
pip install browser-use
Install playwright
…
playwright install
Playwright is an open-source automation library developed by Microsoft that allows developers to write cross-browser web automation scripts for Chromium, Firefox, and WebKit using a single API.
(Chromium is an open-source web browser project initiated by Google that serves as the foundational source code for several popular web browsers.)
Create a text file named run.py
…
vim run.py
Paste this code in the file…you will see that the text assigned to the task variable holds the instruction for the browsing agent.
from langchain_openai import ChatOpenAI
from browser_use import Agent
import asyncio
from dotenv import load_dotenv
load_dotenv()
async def main():
agent = Agent(
task="Compare the price of gpt-4o and DeepSeek-V3",
llm=ChatOpenAI(model="gpt-4o"),
)
await agent.run()
asyncio.run(main())
Give the file execution rights.
Keep in mind, the chmod 777
command grants full read, write and execute permissions to everyone (owner, group, and others) on a file or directory, which fundamentally breaks the principle of least privilege and can create massive security vulnerabilities by allowing any user on the system to modify, delete, or execute the file.
This blanket permission means that potentially malicious users or processes could compromise the file's integrity, potentially leading to system security breaches, so don’t do this on a server.
chmod 777 run.py
Create an environmental file…
vim .env
Paste the line below in the file, with your OpenAI API key…
OPENAI_API_KEY=< Your OpenAI >
And then just your AI Agent…
python3 run.py
Below the steps are shown as the AI Agent executes…it lends a good high level of inspectability into the activity of the AI Agent.
You can see how the goals are defined and the Agent moves from goal to goal…with actions, memory and evaluating results.
Below the final step and the result is reached.
A summary of the information retrieved from the web and a final answer.
What is interesting, is that structured data is used to share data between the AI Agent and the LLM.
For a deeper insight into the data sent to and from the LLM, you can log into the OpenAI dashboard.
You can see the prompt sent to the LLM in the OpenAI dashboard…
Your task is to extract the content of the page. You will be given a page and a goal and you should extract all relevant information around this goal from the page. If the goal is vague, summarize the page. Respond in json format. Extraction goal: extract the specific pricing details for DeepSeek-V3 from the search results, Page:
Click on an entry, and the Input and Output is given from the User, which in this case is the Web Browsing AI Agent. And the response from the Assistant, which is here the LLM.
The Browser Use project is an innovative approach to understanding and optimising web browser interactions through comprehensive automation and testing frameworks.
By providing builders with a robust toolkit for cross-browser testing and interaction simulation, the project aims to simplify the complex landscape of web application testing across different browser environments.
Connecting AI Agents directly to web interfaces, enabling sophisticated automated interactions that could revolutionise how artificial intelligence navigates and understands digital environments.
A critical link between AI Agents and web-based information, allowing intelligent agents to perform complex web tasks, gather data, and interact with online platforms.
Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.