Detecting AI traffic on your website with crawler identification
Everyday ChatGPT and Perplexity send you invisible traffic in the form of crawlers going through your site to digest information on behalf of user in order to answer their queries. Most website owner don't even know how much traffic they get from AI crawlers, but you can actually count it and optimize for it with just a few simple tricks.
Identification method
Request Headers
Long ago when AI was not even a thing both Google and Bing had agreed that it would be good practice to disclose when they're visiting your website through crawlers. To do this they started publishing a list of ways to identify them.
The first parameter that makes crawler identifiable is the User-Agent
header, which is a string that identifies the browser or crawler that is visiting your website.
This string can be freely configured when doing a request, Google, OpenAI, Perplexity and Bing all publish their user-agent configuration.
IP ranges
User-agents can be freely configured, so it would be possible for someone to impersonate a company by setting their user-agent value to one of the big companies.
That's why in addition to it they also publish a dynamic list of IP ranges from which crawlers are sending requests from, there are multiple lists bsed on the different kinds of crawlers, each list contains a list of IPv4 or IPv6 ranges including a base IP and a subnet mask.
One of the ways to get the IP adress of the request sender even behind a reverse proxy is to use the X-Forwarded-For
header, which contains a list of IP addresses the request has passed through, the first one representing the original sender's address.
Providers
Here is a compiled list of the main user-agents and IP ranges per-provider
Provider | User-Agent(s) | IP Range List(s) | Purpose |
---|---|---|---|
OpenAI | OAI-SearchBot/1.0; +https://openai.com/searchbot | searchbot.json | Crawler in charge of the web search tool on ChatGPT & SearchGPT |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot | chatgpt-user.json | Crawler for custom GPTs interactions with the web | |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot | gptbot.json | Crawler that indexes your website to gather data for AI model training | |
Bing | {browser string}; bingbot/2.0; +http://www.bing.com/bingbot.htm) | bingbot.json | General purpose bing crawler |
{browser string}; adidxbot/2.0; +http://www.bing.com/bingbot.htm | bingbot.json | Bing Ads crawler | |
{browser string}; MicrosoftPreview/2.0; +https://aka.ms/MicrosoftPreview | bingbot.json | Microsoft preview crawler to generate snapshot of website on both desktop & mobile | |
Perplexity | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) | perplexitybot.json | Crawler that shows website in the search |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user) | perplexity-user.json | Crawler activated on user behalf for search purposes | |
googlebot.json, special-crawlers.json | compatible; Googlebot/2.1; +http://www.google.com/bot.html | The google search general purpose crawler | |
user-triggered-fetchers.json, user-triggered-fetchers-google.json | Google-Extended uses other Google user-agents | Special purpose crawler for Gemini AI and other AI content on Google |
Analysis
Logs
Either you run your analysis from logs, probably the most convenient way for websites hosted on cloud-providers, here the approach is to do it on a daily or weekly basis and get delayed data from it.
Middleware
Another approach is to include a small middleware within your server that does this pattern matching over IP and user-agents and logs everything in realtime, this is what Doppler enables you to do thanks to its lightweight SDK that you can plug on your server-side web framework like Nuxt.js or Next.js.
Compatible robots.txt
If you don't want to cut your website from AI traffic your main concern should be to make sure your robots.txt
file does not disallow any of these user-agents, at least not on public pages that you want indexed or used in AI responses.