Back to blog
Technical AEO

The robots.txt Mistake That's Blocking AI Crawlers from Your Site

· GetCitedBy

The robots.txt file is a plain text file at the root of your website that tells web crawlers which pages they can and cannot access. It has been a foundational part of how search engines interact with websites since 1994. Now it plays an equally critical role in determining whether AI platforms can access your content — and therefore whether they can cite your brand in their responses.

Many companies are accidentally blocking AI crawlers without realizing it. One misconfigured line in your robots.txt could be the reason ChatGPT, Claude, Gemini, or Perplexity has never seen your content and doesn’t know your brand exists.

Which AI crawlers exist and what do they do?

Each major AI platform has its own web crawler, similar to how Google has Googlebot. Here are the ones that matter:

GPTBot (User-agent: GPTBot) — OpenAI’s web crawler. It feeds content into ChatGPT’s retrieval system and may contribute to training data for future models. Allowing GPTBot is how you become visible in ChatGPT’s browsing-enabled responses.

OAI-SearchBot (User-agent: OAI-SearchBot) — OpenAI’s dedicated search crawler, separate from GPTBot. This one specifically powers ChatGPT’s search functionality. If you block GPTBot but allow OAI-SearchBot, you’ll appear in search results but may miss broader ChatGPT citations.

ClaudeBot (User-agent: ClaudeBot) — Anthropic’s web crawler for Claude. It collects content that Claude can reference when users ask questions that benefit from current web information.

PerplexityBot (User-agent: PerplexityBot) — Perplexity’s crawler. Since Perplexity is fundamentally a search product that cites its sources with direct links, being accessible to PerplexityBot often results in the most immediately visible citations.

Google-Extended (User-agent: Google-Extended) — Google’s AI-specific crawler. It’s separate from the standard Googlebot. Blocking Google-Extended means your content won’t be used by Gemini for AI-generated answers, even if Googlebot can still crawl your site for traditional search.

Bytespider (User-agent: Bytespider) — ByteDance’s crawler, which powers AI features across their products. Less relevant for most Canadian service businesses but worth knowing about.

CCBot (User-agent: CCBot) — The Common Crawl bot, which builds the open datasets used to train many AI models. Allowing CCBot means your content may be included in training data for a wide range of models beyond the major platforms.

What are the most common blocking patterns?

We’ve reviewed robots.txt files from hundreds of business websites. Here are the patterns that cause the most problems:

The blanket block

User-agent: *
Disallow: /

This blocks every crawler from every page. It’s rare for an entire site, but we see it applied to specific directories that contain valuable content — like /blog/, /resources/, or /case-studies/. If your most citable content lives behind a blanket disallow, no AI platform will ever see it.

The wildcard bot block

User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

Some companies added these blocks when AI crawling became a public conversation in 2023-2024. The intent was often to protect content from being used as training data. The unintended consequence is complete AI invisibility — your brand simply doesn’t exist in these platforms’ knowledge.

The inherited security block

Many companies have robots.txt configurations created by IT or security teams that block all non-recognized user agents. These configurations predate AI crawlers and weren’t designed with AI visibility in mind, but they effectively prevent all AI platforms from accessing your content.

The CMS default

Some content management systems and hosting platforms ship with robots.txt configurations that block non-standard bots by default. If you’ve never explicitly reviewed your robots.txt with AI crawlers in mind, your CMS may be making this decision for you.

How should you properly configure robots.txt for AI visibility?

The recommended configuration explicitly allows all major AI crawlers while maintaining any specific blocks you need for other reasons:

User-agent: *
Allow: /

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

A few important notes on this configuration:

Be explicit. Even though User-agent: * with Allow: / technically allows everything, adding explicit entries for each AI crawler is a best practice. Some AI crawlers may not honor the wildcard entry, and explicit entries make your intent clear.

Block specific paths, not specific bots. If you have content that genuinely shouldn’t be crawled (admin pages, staging environments, private customer areas), block those paths for all agents rather than blocking specific bots from your entire site.

Include your sitemap. The Sitemap directive helps all crawlers — search engine and AI — discover your content efficiently. Make sure your sitemap is current and includes all pages you want to be discoverable.

Add llms.txt. Beyond robots.txt, the llms.txt standard provides a structured overview of your content specifically for AI consumption. Add it to your site root alongside robots.txt. This gives AI systems a clear map of your most important content. We’ve implemented one on our own site — you can see it at getcitedby.tech/llms.txt.

How do you verify that AI crawlers can access your site?

After updating your robots.txt, you should verify that the changes are working as intended.

Test with Google’s robots.txt tester

Google Search Console includes a robots.txt tester that lets you check whether specific user agents can access specific URLs. Test each AI crawler user agent against your most important content pages.

Check server logs

If you have access to your server logs, look for requests from AI crawler user agents. After allowing them in robots.txt, you should start seeing requests from GPTBot, ClaudeBot, and PerplexityBot within days to weeks. If you don’t see any requests after a few weeks, the crawlers may not have rediscovered your site yet — submitting your sitemap or creating fresh content can help trigger a recrawl.

Query the AI platforms directly

The most direct test: ask ChatGPT, Claude, or Perplexity a question that should reference your content, and see if they can find and cite it. Be specific — use queries that reference your brand name or unique content that only exists on your site.

Use online robots.txt validators

Several free tools will parse your robots.txt file and tell you exactly what each user agent can and cannot access. Run your file through one of these after any changes.

What is the llms.txt standard?

The llms.txt specification is an emerging standard that provides AI systems with a structured, human-readable overview of your website’s content. It’s a markdown file placed at your site root (yourdomain.com/llms.txt) that describes your organization, your key content, and how you’d like AI systems to understand and reference your brand.

Think of it as the AI equivalent of a sitemap, but designed for how language models process information rather than how search crawlers index pages. While robots.txt tells crawlers what they can access, llms.txt tells AI systems what your content actually means and how it’s organized.

A companion file, llms-full.txt, provides more comprehensive content for AI platforms that want deeper context. Together, these files give AI systems a clear, structured understanding of your brand.

The standard is still relatively new, but early adoption is an advantage. AI platforms are actively looking for these signals, and having them in place differentiates your site from the majority that don’t.

What about the tradeoff between visibility and content protection?

This is the most common concern we hear from business owners considering their AI crawler policy. The concern is legitimate: allowing AI crawlers means your content may be used in model training, and you don’t get direct compensation for that use.

Here’s our perspective after working with hundreds of businesses:

For most service businesses, visibility is far more valuable than content protection. Your content’s purpose is to attract and convert buyers. If AI platforms can cite your content and recommend your brand, that’s a powerful distribution channel. The cost of being invisible — lost awareness, lost consideration, lost deals — almost always exceeds the theoretical value of content exclusivity.

The exception is proprietary research or data. If your business model depends on selling access to proprietary information, restricting AI crawler access to that specific content makes sense. But you can do this surgically — block specific paths while keeping your marketing content accessible.

The market is moving toward citation, not just training. Platforms like Perplexity already cite sources with links. ChatGPT’s browsing mode attributes content. The trend is toward AI systems that reference and credit their sources, which means allowing access increasingly comes with attribution benefits.

For most business owners, the right answer is to allow AI crawlers, implement llms.txt for maximum discoverability, and focus your energy on ensuring the content they find is accurate, well-structured, and positions your brand authoritatively. If you want help auditing your current setup, request a free audit.

Frequently Asked Questions

Will allowing AI crawlers hurt my SEO?

No. AI crawlers are completely separate from search engine crawlers like Googlebot. Allowing GPTBot or ClaudeBot has no impact on your Google rankings. In fact, being cited by AI platforms can create a positive halo effect that improves your traditional SEO metrics through increased brand search volume and click-through rates.

How often do AI crawlers visit my site?

It varies by platform and by the perceived importance of your site. High-authority sites with frequently updated content get crawled more often. Most sites that allow AI crawlers see visits ranging from daily to weekly. You can check your server logs to see the actual crawl frequency for each bot.

Do I need to allow all AI crawlers or can I be selective?

You can be selective. Each AI crawler has its own user agent, so you can allow some while blocking others. However, we recommend allowing all major crawlers because each platform has a different user base, and you want your brand visible wherever your buyers are asking questions.

What if I previously blocked AI crawlers — how long until I start appearing in responses?

After unblocking, there’s a lag. Perplexity (which does real-time search) may pick up your content within days. ChatGPT and Claude typically take longer — weeks to months — as their systems recrawl and reindex your site. Creating fresh, high-quality content after unblocking can accelerate the process by giving crawlers a reason to visit.

G

GetCitedBy

GetCitedBy

Is your brand visible to AI?

Get a free score showing how ChatGPT, Claude, Gemini, and Perplexity see your brand today.

Get Your Free AI Visibility Score