Browse other articles

Detecting AI traffic on your website with crawler identification

Everyday ChatGPT and Perplexity send you invisible traffic in the form of crawlers going through your site to digest information on behalf of user in order to answer their queries. Most website owner don't even know how much traffic they get from AI crawlers, but you can actually count it and optimize for it with just a few simple tricks.

Identification method

Request Headers

Long ago when AI was not even a thing both Google and Bing had agreed that it would be good practice to disclose when they're visiting your website through crawlers. To do this they started publishing a list of ways to identify them.

The first parameter that makes crawler identifiable is the User-Agent header, which is a string that identifies the browser or crawler that is visiting your website.

This string can be freely configured when doing a request, Google, OpenAI, Perplexity and Bing all publish their user-agent configuration.

IP ranges

User-agents can be freely configured, so it would be possible for someone to impersonate a company by setting their user-agent value to one of the big companies.

That's why in addition to it they also publish a dynamic list of IP ranges from which crawlers are sending requests from, there are multiple lists bsed on the different kinds of crawlers, each list contains a list of IPv4 or IPv6 ranges including a base IP and a subnet mask.

One of the ways to get the IP adress of the request sender even behind a reverse proxy is to use the X-Forwarded-For header, which contains a list of IP addresses the request has passed through, the first one representing the original sender's address.

Providers

Here is a compiled list of the main user-agents and IP ranges per-provider

ProviderUser-Agent(s)IP Range List(s)Purpose
OpenAIOAI-SearchBot/1.0; +https://openai.com/searchbotsearchbot.jsonCrawler in charge of the web search tool on ChatGPT & SearchGPT
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/botchatgpt-user.jsonCrawler for custom GPTs interactions with the web
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbotgptbot.jsonCrawler that indexes your website to gather data for AI model training
Bing{browser string}; bingbot/2.0; +http://www.bing.com/bingbot.htm)bingbot.jsonGeneral purpose bing crawler
{browser string}; adidxbot/2.0; +http://www.bing.com/bingbot.htmbingbot.jsonBing Ads crawler
{browser string}; MicrosoftPreview/2.0; +https://aka.ms/MicrosoftPreviewbingbot.jsonMicrosoft preview crawler to generate snapshot of website on both desktop & mobile
PerplexityMozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)perplexitybot.jsonCrawler that shows website in the search
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user)perplexity-user.jsonCrawler activated on user behalf for search purposes
Googlegooglebot.json, special-crawlers.jsoncompatible; Googlebot/2.1; +http://www.google.com/bot.htmlThe google search general purpose crawler
user-triggered-fetchers.json, user-triggered-fetchers-google.jsonGoogle-Extended uses other Google user-agentsSpecial purpose crawler for Gemini AI and other AI content on Google

Analysis

Logs

Either you run your analysis from logs, probably the most convenient way for websites hosted on cloud-providers, here the approach is to do it on a daily or weekly basis and get delayed data from it.

Middleware

Another approach is to include a small middleware within your server that does this pattern matching over IP and user-agents and logs everything in realtime, this is what Doppler enables you to do thanks to its lightweight SDK that you can plug on your server-side web framework like Nuxt.js or Next.js.

Compatible robots.txt

If you don't want to cut your website from AI traffic your main concern should be to make sure your robots.txt file does not disallow any of these user-agents, at least not on public pages that you want indexed or used in AI responses.