Should you block AI crawlers? GPTBot, ClaudeBot, and PerplexityBot explained (2026)

Most businesses should not block AI crawlers, because blocking the wrong bot removes you from AI search answers entirely, and the answers are where buyers now start. The nuanced version: training crawlers and search crawlers are different bots with different jobs, and the 2026 consensus posture is allow search, decide deliberately on training. Publishers monetizing content licensing have a real case for blocking training bots; businesses that want customers do not. This guide names each crawler, shows what the blocking data actually says, and gives you the decision framework plus the audit that catches accidental blocks.

What do AI crawlers actually do on your site?

AI crawlers do three distinct jobs, and conflating them is the root of most bad blocking decisions. Training crawlers (GPTBot, ClaudeBot, CCBot, Meta-ExternalAgent) collect content to train future models; their effect on your visibility is long-term and diffuse. Search index crawlers (OAI-SearchBot, Claude-SearchBot) build the retrieval indexes that power live AI search answers; blocking these removes you from citations directly. User-action fetchers (ChatGPT-User, Perplexity-User) retrieve a specific page in real time because a user’s conversation asked for it; blocking these breaks your site’s presence at the exact moment a prospect is engaging.

The split is recent and deliberate. OpenAI separated GPTBot (training) from OAI-SearchBot (search), and Anthropic runs ClaudeBot for training alongside Claude-SearchBot for retrieval, per Anagram’s 2026 crawler guide. That separation means you can block model training while staying fully visible in AI search. Sites still running 2024-era blanket blocks are making a choice they never actually made.

What percentage of sites block AI crawlers in 2026?

Blocking is common among publishers and rare everywhere else, and the trend just reversed for the most-blocked bot. GPTBot is the most-blocked AI crawler at 5.52 percent of DISALLOW rules across Cloudflare’s network in Q1 2026, ahead of CCBot at 5.08 percent, ClaudeBot at 4.88 percent, and Google-Extended at 4.44 percent, per Technology Checker’s robots.txt analysis. Among news publishers, 79 percent block at least one AI training bot, which makes sense: their content is the licensing asset.

The reversal: GPTBot’s ALLOW share (5.84 percent) now exceeds its DISALLOW share (4.71 percent) for the first time in tracking, per Digital Applied’s 2026 crawler statistics. The web is leaning toward letting AI in, because the visibility cost of blocking became measurable. For any business whose growth depends on being found, the base rate is clear: your competitors are mostly not blocking, and the ones who are have often done it by accident.

What does blocking AI crawlers actually cost you?

Blocking search and user-action bots costs you AI answer visibility outright, while blocking training bots costs you future model presence, and the two costs land on different timelines. Block OAI-SearchBot and your pages exit ChatGPT search retrieval; the pipeline mechanics in how to get cited by ChatGPT stop at stage zero for your domain. Block Perplexity’s bots and you disappear from an engine whose users convert at rates covered in why AI search traffic converts better. These losses are immediate and near-total for the affected engine.

Training-bot blocking is a slower bleed: your brand fades from the knowledge future models answer from, unlinked mentions included. The counterweight is crawl cost. GPTBot crawls roughly 1,255 pages for every referral visit it sends back, and ClaudeBot runs near 20,583:1, per Digital Applied. For most sites this bandwidth is negligible; for large media archives it is a real infrastructure bill with no compensating traffic. That asymmetry, not principle, is why the publisher blocking rate and the business blocking rate diverge so sharply.

Which AI crawlers should you allow, engine by engine?

Allow every search and user-action bot, then make one deliberate choice on training bots. The allow list that preserves full AI search visibility: Bingbot (gates ChatGPT and Copilot retrieval), OAI-SearchBot, ChatGPT-User, PerplexityBot, Perplexity-User, Claude-SearchBot, and Google-Extended if you want Gemini grounding. The training list you decide on: GPTBot, ClaudeBot, CCBot, Meta-ExternalAgent, Bytespider.

The decision rule for training bots is a licensing question. If your content is the product and you have negotiating power or an active licensing conversation, blocking training bots preserves that position; that is the publisher play. If your content exists to bring you customers, allow training bots too, because presence in model weights is the durable layer of AI visibility, the one covered in how to get cited by Claude and how to rank in Perplexity. The middle path some brands run: allow training bots on the pages that define the brand (about, services, key explainers) and block them on premium content archives. Robots.txt supports path-level rules; use them.

How do sites block AI crawlers by accident?

Sites block AI crawlers by accident at the CDN layer, where bot protection defaults treat AI agents as threats, and the site owner never sees it. Roughly 27 percent of B2B SaaS and ecommerce sites unknowingly block major LLM crawlers through CDN-level rules, per Digital Applied’s analysis. Cloudflare’s AI-bot blocking toggle, WAF rules, and rate limiters all sit in front of robots.txt; your robots file can say allow while your firewall returns 403s.

The audit takes an hour. Pull your server logs or CDN analytics and search for the major AI user agents: GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, Bingbot. Zero hits from a bot that should be crawling you is the accidental-block signature; so are 403 or 429 response patterns. Check your Cloudflare dashboard specifically for the “Block AI bots” toggle and any Super Bot Fight Mode settings, because both ship with defaults that treat search-index crawlers and training crawlers identically. Managed WordPress hosts and security plugins add another layer: several popular firewall plugins added AI-bot blocklists in 2025 updates, enabled without notice on existing installs. Then verify from outside: fetch your key pages with the relevant user-agent strings and confirm 200s. If ChatGPT can quote your homepage when asked, retrieval is working; if it claims the site is inaccessible, start with the firewall, not the content. This diagnostic sequence is the first step in why your website is not showing in AI search.

What should your robots.txt look like in 2026?

Your robots.txt should name each AI bot explicitly with a deliberate allow or disallow, because implicit policy is how accidents persist. A pattern that fits most businesses: explicit Allow blocks for OAI-SearchBot, ChatGPT-User, PerplexityBot, Perplexity-User, Claude-SearchBot, and Bingbot; a considered decision line for GPTBot, ClaudeBot, and CCBot; and Disallow for scrapers you get no value from, like Bytespider.

Three maintenance rules keep it honest. First, review quarterly: new bots appear and engines split or rename agents, as OpenAI and Anthropic both did. Second, keep robots.txt and CDN rules in sync; the file is policy, the firewall is enforcement, and drift between them is silent. Third, do not confuse robots.txt with llms.txt, which is a content-guidance proposal rather than an access control, as we covered in what is llms.txt. Access decisions live in robots.txt and your CDN. Get those two aligned and your AI visibility rests on content quality, where it belongs, per the 2026 GEO checklist.

Frequently asked questions

Does blocking GPTBot remove my site from ChatGPT?

No. GPTBot is the training crawler. ChatGPT search retrieval runs through OAI-SearchBot and Bing’s index, so a GPTBot block leaves live search citations intact. Blocking OAI-SearchBot or Bingbot is what removes you from ChatGPT answers.

Do AI crawlers respect robots.txt?

The major ones (OpenAI, Anthropic, Google, Microsoft) honor robots.txt directives. Some scrapers do not, which is why CDN-level enforcement exists. Match your firewall rules to your robots.txt policy rather than choosing one or the other.

Should a small business block anything?

Usually just the no-value scrapers. A small business has everything to gain from AI visibility and no licensing revenue to protect, so the allow-everything-deliberate posture wins. Spend the effort on content structure instead, per how to optimize content for AI.

Can I block AI crawlers from some pages but not others?

Yes. Robots.txt supports path-level Disallow rules per user agent, so you can expose service pages and explainers while walling off archives, member content, or premium research. This is the standard middle path for content-heavy brands.

How do I know if my block or allow is working?

Server logs show crawler hits; Bing Webmaster Tools shows index status; and live tests (asking each engine about your pages) show the end result. Recheck after any CDN or security change, because that is when silent blocks appear.

Want to know whether AI engines can actually reach and cite your site today? Request a free analysis and we will run the crawler audit engine by engine.

Tagged

ai crawlers robots.txt geo aeo technical seo