Skip to main content

robots.txt Setup for AI Crawlers

🎯 Quick Summary

  • Step-by-step guide to creating and deploying robots.txt for AI crawler control
  • Copy-paste templates for common scenarios (open, selective, protective)
  • Learn robots.txt syntax, common mistakes, and testing procedures
  • Implement dynamic rules based on your content strategy

📋 Table of Contents

  1. robots.txt Basics
  2. AI-Specific Syntax
  3. Template Library
  4. Implementation Steps
  5. Testing & Validation
  6. Troubleshooting

🔑 Key Concepts at a Glance

  • robots.txt: Text file at domain root controlling crawler access
  • User-agent: Specific crawler identifier
  • Allow: Permit access to path
  • Disallow: Block access to path
  • Wildcard (*): Match multiple crawlers or paths

🏷️ Metadata

Tags: robots-txt, technical, implementation, crawler-control Status: %%ACTIVE%% Complexity: %%MODERATE%% Max Lines: 400 (this file: 395 lines) Reading Time: 9 minutes Last Updated: 2025-01-18


robots.txt Basics

File Location & Format

URL: https://yoursite.com/robots.txt

Requirements:

  • Must be at root domain (not subdirectory)
  • Must be named exactly robots.txt (lowercase)
  • Must be plain text (UTF-8 encoding)
  • Must be publicly accessible (no authentication)

Basic Structure:

User-agent: [crawler name]
Disallow: [path to block]
Allow: [path to allow]

[Blank line between groups]

User-agent: [another crawler]
Disallow: [path]

Syntax Rules

Case-Sensitivity:

User-agent: GPTBot     ✓ Correct
User-agent: gptbot ✗ Wrong (case-sensitive)
User-agent: GPTBOT ✗ Wrong

Path Matching:

Disallow: /admin       → Blocks /admin, /admin/, /admin/page
Disallow: /admin/ → Blocks /admin/ and subdirectories
Disallow: /admin.html → Blocks only /admin.html

Wildcards:

User-agent: *          → All crawlers
Disallow: /*.pdf$ → All PDF files
Allow: /blog/*/public → blog/any-folder/public

AI-Specific Syntax

User-Agent Identifiers

Exact names (case-sensitive):

# OpenAI ChatGPT
User-agent: GPTBot

# Anthropic Claude
User-agent: Claude-Web

# Google Gemini/Bard
User-agent: Google-Extended

# Perplexity
User-agent: PerplexityBot

# Common Crawl
User-agent: CCBot

# Meta Llama
User-agent: FacebookBot

# Apple Intelligence
User-agent: Applebot-Extended

Allow vs Disallow

Disallow (Block):

User-agent: GPTBot
Disallow: / # Block everything

Allow (Permit):

User-agent: GPTBot
Allow: / # Allow everything

Mixed (Selective):

User-agent: GPTBot
Disallow: / # Block everything first
Allow: /blog/ # Then allow specific sections
Allow: /guides/

Order Matters

Most specific rule wins:

User-agent: GPTBot
Disallow: /admin/
Allow: /admin/public/

Result: /admin/ blocked EXCEPT /admin/public/

Template Library

Template 1: Open Access (Default)

Use when: SaaS, public content, maximum visibility desired

# robots.txt - Open Access Strategy

# Search Engines (always allow)
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# AI Crawlers - Allow All
User-agent: GPTBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

# Protect only admin/internal
User-agent: *
Disallow: /admin/
Disallow: /api/internal/
Disallow: /*.json$
Disallow: /*?* # Block query parameters

# Sitemaps
Sitemap: https://yoursite.com/sitemap.xml

Template 2: Selective Access

Use when: Balance visibility and protection

# robots.txt - Selective Strategy

# Allow major platforms only
User-agent: GPTBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Google-Extended
Allow: /

# Block aggregators & unknowns
User-agent: CCBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: *
Disallow: /

# Sitemaps
Sitemap: https://yoursite.com/sitemap.xml

Template 3: Content Protection

Use when: Premium content, competitive concerns

# robots.txt - Protective Strategy

# Block all AI from premium content
User-agent: GPTBot
Disallow: /premium/
Disallow: /members/
Disallow: /subscriber/
Allow: /

User-agent: Claude-Web
Disallow: /premium/
Disallow: /members/
Allow: /

User-agent: Google-Extended
Disallow: /premium/
Allow: /

# Block Common Crawl entirely
User-agent: CCBot
Disallow: /

# Allow free content
Allow: /blog/
Allow: /free-resources/
Allow: /guides/

# Sitemaps (public content only)
Sitemap: https://yoursite.com/sitemap-public.xml

Template 4: News Publisher

Use when: Time-sensitive content, protect recent articles

# robots.txt - News Publisher

# Protect current year content
User-agent: GPTBot
Disallow: /2025/
Allow: /

User-agent: Claude-Web
Disallow: /2025/
Allow: /

User-agent: Google-Extended
Disallow: /2025/
Disallow: /2024/
Allow: /

# Block Perplexity (competes for traffic)
User-agent: PerplexityBot
Disallow: /

# Allow archive (2023 and older)
Allow: /2024/
Allow: /2023/
Allow: /2022/

# Sitemaps
Sitemap: https://yoursite.com/sitemap-current.xml
Sitemap: https://yoursite.com/sitemap-archive.xml

Template 5: Documentation Site

Use when: Developer docs, maximum discoverability needed

# robots.txt - Documentation Site

# Allow ALL AI crawlers (maximum visibility)
User-agent: *
Allow: /

# Explicit permissions for major platforms
User-agent: GPTBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

# Comprehensive sitemaps
Sitemap: https://docs.yoursite.com/sitemap.xml
Sitemap: https://docs.yoursite.com/api-reference-sitemap.xml
Sitemap: https://docs.yoursite.com/guides-sitemap.xml

Template 6: E-commerce

Use when: Product catalog, want AI shopping assistant visibility

# robots.txt - E-commerce

# Allow AI crawlers for product discovery
User-agent: GPTBot
Allow: /products/
Allow: /categories/
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/

User-agent: Google-Extended
Allow: /products/
Allow: /categories/
Disallow: /cart/
Disallow: /account/

User-agent: PerplexityBot
Allow: /products/
Disallow: /cart/
Disallow: /account/

# Block from customer data
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /admin/
Disallow: /orders/

# Product sitemaps
Sitemap: https://yoursite.com/sitemap-products.xml
Sitemap: https://yoursite.com/sitemap-categories.xml

Implementation Steps

Step 1: Create robots.txt File

Local creation:

# Create file
nano robots.txt

# Or use text editor
# Save as: robots.txt (UTF-8, no BOM)

Example content:

User-agent: GPTBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: CCBot
Disallow: /

Sitemap: https://yoursite.com/sitemap.xml

Step 2: Upload to Root Directory

Via FTP/SFTP:

Upload to: /public_html/robots.txt
Or: /var/www/html/robots.txt
Or: /htdocs/robots.txt

Final URL: https://yoursite.com/robots.txt

Via cPanel:

1. Login to cPanel
2. File Manager
3. Navigate to public_html/
4. Upload robots.txt
5. Set permissions: 644

Via Git:

# Add to repository root
git add robots.txt
git commit -m "Add AI crawler controls"
git push origin main

# Deploy (depends on your setup)

Step 3: Verify Accessibility

Test URL directly:

Visit: https://yoursite.com/robots.txt

Should see plain text content (not 404, not redirect)

cURL test:

curl -I https://yoursite.com/robots.txt

# Should return:
HTTP/1.1 200 OK
Content-Type: text/plain

Step 4: Monitor Impact

Week 1: Verify blocking

# Check server logs
grep "GPTBot" /var/log/access.log

# If blocked:
GPTBot ... 403 Forbidden

# If allowed:
GPTBot ... 200 OK

Month 1: Measure citation changes

Before blocking CCBot:
- Citation Rate: 28%

After blocking CCBot:
- Citation Rate: 27% (-1pp)
- CCBot visits: 0

Conclusion: Minimal impact, keep blocked

Testing & Validation

robots.txt Tester Tools

1. Google Search Console

URL: https://search.google.com/search-console

Steps:
1. Add property (yoursite.com)
2. Go to: Legacy Tools → robots.txt Tester
3. Enter path to test: /blog/post
4. Select User-agent: GPTBot
5. Click "Test"

Result:
✓ Allowed
✗ Blocked

2. Online robots.txt Validators

- https://www.robotstxt.org/validator/
- https://technicalseo.com/tools/robots-txt/
- https://en.ryte.com/free-tools/robots-txt-validator/

Upload or paste your robots.txt
Get syntax error detection

3. Manual Testing

# Test if crawler can access
curl -A "GPTBot/1.0" https://yoursite.com/blog

# Should return:
200 OK (if allowed)
403 Forbidden (if blocked by server)

# Note: robots.txt is advisory
# Server must enforce blocks separately

Common Validation Errors

Error 1: Wrong location

❌ https://yoursite.com/assets/robots.txt
❌ https://yoursite.com/admin/robots.txt
✓ https://yoursite.com/robots.txt

Error 2: Wrong file type

❌ robots.txt.txt
❌ Robots.TXT
❌ robots.html
✓ robots.txt

Error 3: Syntax errors

❌ User-Agent: GPTBot  (wrong case on "agent")
✓ User-agent: GPTBot

❌ disallow: /admin (lowercase directive)
✓ Disallow: /admin

❌ User-agent: GPTBot Allow: / (same line)
✓ User-agent: GPTBot
Allow: /

Troubleshooting

Issue 1: Crawler Still Accessing Blocked Content

Possible causes:

1. robots.txt cached by crawler
→ Wait 24-48 hours for re-crawl

2. Syntax error in robots.txt
→ Validate with online tool

3. robots.txt not at root
→ Move to https://yoursite.com/robots.txt

4. Server not enforcing (robots.txt is advisory)
→ Implement server-level blocks

Server-level enforcement (Apache):

# .htaccess
RewriteEngine On

# Block GPTBot from /premium/
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC]
RewriteCond %{REQUEST_URI} ^/premium/ [NC]
RewriteRule .* - [F,L]

Issue 2: Search Engine Crawler Blocked Unintentionally

Symptom: Google can't index your site

Cause: Blocked Googlebot in robots.txt

Fix:

# Was:
User-agent: *
Disallow: /

# Should be:
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: * # AI crawlers
Disallow: /

Issue 3: No Impact on Citations

Symptom: Blocked crawler but still getting cited

Explanation: Model already trained on your content

Solution:

robots.txt blocks future training, not current knowledge

To reduce existing citations:
1. Wait for next model training cycle (6-12 months)
2. Remove/update content
3. Request removal from AI company

Crawler Management:

Strategy:

Monitoring:


🆘 Need Help?

robots.txt Support:

Tools:


Last updated: 2025-01-18 | Edit this page