Detecting AI traffic on your website with crawler identification

June 12nd, 2025

Everyday ChatGPT and Perplexity send you invisible traffic in the form of crawlers going through your site to digest information on behalf of user in order to answer their queries. Most website owner don't even know how much traffic they get from AI crawlers, but you can actually count it and optimize for it with just a few simple tricks.

Identification method

Request Headers

Long ago when AI was not even a thing both Google and Bing had agreed that it would be good practice to disclose when they're visiting your website through crawlers. To do this they started publishing a list of ways to identify them.

The first parameter that makes crawler identifiable is the User-Agent header, which is a string that identifies the browser or crawler that is visiting your website.

This string can be freely configured when doing a request, Google, OpenAI, Perplexity and Bing all publish their user-agent configuration.

IP ranges

User-agents can be freely configured, so it would be possible for someone to impersonate a company by setting their user-agent value to one of the big companies.

That's why in addition to it they also publish a dynamic list of IP ranges from which crawlers are sending requests from, there are multiple lists bsed on the different kinds of crawlers, each list contains a list of IPv4 or IPv6 ranges including a base IP and a subnet mask.

One of the ways to get the IP adress of the request sender even behind a reverse proxy is to use the X-Forwarded-For header, which contains a list of IP addresses the request has passed through, the first one representing the original sender's address.

Providers

Here is a compiled list of the main user-agents and IP ranges per-provider

Provider	User-Agent(s)	IP Range List(s)	Purpose
OpenAI	`OAI-SearchBot/1.0; +https://openai.com/searchbot`	searchbot.json	Crawler in charge of the web search tool on ChatGPT & SearchGPT
	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot`	chatgpt-user.json	Crawler for custom GPTs interactions with the web
	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot`	gptbot.json	Crawler that indexes your website to gather data for AI model training
Bing	`{browser string}; bingbot/2.0; +http://www.bing.com/bingbot.htm)`	bingbot.json	General purpose bing crawler
	`{browser string}; adidxbot/2.0; +http://www.bing.com/bingbot.htm`	bingbot.json	Bing Ads crawler
	`{browser string}; MicrosoftPreview/2.0; +https://aka.ms/MicrosoftPreview`	bingbot.json	Microsoft preview crawler to generate snapshot of website on both desktop & mobile
Perplexity	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)`	perplexitybot.json	Crawler that shows website in the search
	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user)`	perplexity-user.json	Crawler activated on user behalf for search purposes
Google	googlebot.json, special-crawlers.json	`compatible; Googlebot/2.1; +http://www.google.com/bot.html`	The google search general purpose crawler
	user-triggered-fetchers.json, user-triggered-fetchers-google.json	`Google-Extended` uses other Google user-agents	Special purpose crawler for Gemini AI and other AI content on Google

Analysis

Logs

Either you run your analysis from logs, probably the most convenient way for websites hosted on cloud-providers, here the approach is to do it on a daily or weekly basis and get delayed data from it.

Middleware

Another approach is to include a small middleware within your server that does this pattern matching over IP and user-agents and logs everything in realtime, this is what Doppler enables you to do thanks to its lightweight SDK that you can plug on your server-side web framework like Nuxt.js or Next.js.

Compatible robots.txt

If you don't want to cut your website from AI traffic your main concern should be to make sure your robots.txt file does not disallow any of these user-agents, at least not on public pages that you want indexed or used in AI responses.

Content

Free Tools