GPTBot

OpenAI’s Web Crawler

Cobus Greyling
3 min readAug 9, 2023

--

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

There is an upheaval on the internet regarding GPTBot and how it will be used to train GPT5.

However, crawlers are as old as the internet, with companies like Google using crawlers to perform actions for its products automatically.

Crawlers, or bots are generic terms for an automated process to automatically discover and scan websites, following links.

Web Google uses crawlers and fetchers to perform actions for its products, either automatically or triggered by user request.

Hence the notion of bots have been used extensively by various companies; should the transparency of OpenAI and the ability to opt-out not lauded?

GPTBot, OpenAI’s web crawler can be identified by its user agent and string:

User agent token: GPTBot

Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

Disallowing GPTBot

In order to block the GPTBot to access a website, add the following to robots.txt:

User-agent: GPTBot
Disallow: /

GPTBot Customised Access

Directing GPTBot to only access parts of a site, add the GPTBot token to site’s robots.txt like this:

User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/

⭐️ Follow me on LinkedIn for updates on Conversational AI ⭐️

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI and language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn

--

--

Cobus Greyling
Cobus Greyling

Written by Cobus Greyling

I’m passionate about exploring the intersection of AI & language. www.cobusgreyling.com

No responses yet