robots.txt Setup for AI Crawlers
🎯 Quick Summary
- Step-by-step guide to creating and deploying robots.txt for AI crawler control
- Copy-paste templates for common scenarios (open, selective, protective)
- Learn robots.txt syntax, common mistakes, and testing procedures
- Implement dynamic rules based on your content strategy
📋 Table of Contents
- robots.txt Basics
- AI-Specific Syntax
- Template Library
- Implementation Steps
- Testing & Validation
- Troubleshooting
🔑 Key Concepts at a Glance
- robots.txt: Text file at domain root controlling crawler access
- User-agent: Specific crawler identifier
- Allow: Permit access to path
- Disallow: Block access to path
- Wildcard (*): Match multiple crawlers or paths
🏷️ Metadata
Tags: robots-txt, technical, implementation, crawler-control
Status: %%ACTIVE%%
Complexity: %%MODERATE%%
Max Lines: 400 (this file: 395 lines)
Reading Time: 9 minutes
Last Updated: 2025-01-18
robots.txt Basics
File Location & Format
URL: https://yoursite.com/robots.txt
Requirements:
- Must be at root domain (not subdirectory)
- Must be named exactly
robots.txt(lowercase) - Must be plain text (UTF-8 encoding)
- Must be publicly accessible (no authentication)
Basic Structure:
User-agent: [crawler name]
Disallow: [path to block]
Allow: [path to allow]
[Blank line between groups]
User-agent: [another crawler]
Disallow: [path]
Syntax Rules
Case-Sensitivity:
User-agent: GPTBot ✓ Correct
User-agent: gptbot ✗ Wrong (case-sensitive)
User-agent: GPTBOT ✗ Wrong
Path Matching:
Disallow: /admin → Blocks /admin, /admin/, /admin/page
Disallow: /admin/ → Blocks /admin/ and subdirectories
Disallow: /admin.html → Blocks only /admin.html
Wildcards:
User-agent: * → All crawlers
Disallow: /*.pdf$ → All PDF files
Allow: /blog/*/public → blog/any-folder/public
AI-Specific Syntax
User-Agent Identifiers
Exact names (case-sensitive):
# OpenAI ChatGPT
User-agent: GPTBot
# Anthropic Claude
User-agent: Claude-Web
# Google Gemini/Bard
User-agent: Google-Extended
# Perplexity
User-agent: PerplexityBot
# Common Crawl
User-agent: CCBot
# Meta Llama
User-agent: FacebookBot
# Apple Intelligence
User-agent: Applebot-Extended
Allow vs Disallow
Disallow (Block):
User-agent: GPTBot
Disallow: / # Block everything
Allow (Permit):
User-agent: GPTBot
Allow: / # Allow everything
Mixed (Selective):
User-agent: GPTBot
Disallow: / # Block everything first
Allow: /blog/ # Then allow specific sections
Allow: /guides/
Order Matters
Most specific rule wins:
User-agent: GPTBot
Disallow: /admin/
Allow: /admin/public/
Result: /admin/ blocked EXCEPT /admin/public/
Template Library
Template 1: Open Access (Default)
Use when: SaaS, public content, maximum visibility desired
# robots.txt - Open Access Strategy
# Search Engines (always allow)
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
# AI Crawlers - Allow All
User-agent: GPTBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: PerplexityBot
Allow: /
# Protect only admin/internal
User-agent: *
Disallow: /admin/
Disallow: /api/internal/
Disallow: /*.json$
Disallow: /*?* # Block query parameters
# Sitemaps
Sitemap: https://yoursite.com/sitemap.xml
Template 2: Selective Access
Use when: Balance visibility and protection
# robots.txt - Selective Strategy
# Allow major platforms only
User-agent: GPTBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: Google-Extended
Allow: /
# Block aggregators & unknowns
User-agent: CCBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: *
Disallow: /
# Sitemaps
Sitemap: https://yoursite.com/sitemap.xml
Template 3: Content Protection
Use when: Premium content, competitive concerns
# robots.txt - Protective Strategy
# Block all AI from premium content
User-agent: GPTBot
Disallow: /premium/
Disallow: /members/
Disallow: /subscriber/
Allow: /
User-agent: Claude-Web
Disallow: /premium/
Disallow: /members/
Allow: /
User-agent: Google-Extended
Disallow: /premium/
Allow: /
# Block Common Crawl entirely
User-agent: CCBot
Disallow: /
# Allow free content
Allow: /blog/
Allow: /free-resources/
Allow: /guides/
# Sitemaps (public content only)
Sitemap: https://yoursite.com/sitemap-public.xml
Template 4: News Publisher
Use when: Time-sensitive content, protect recent articles
# robots.txt - News Publisher
# Protect current year content
User-agent: GPTBot
Disallow: /2025/
Allow: /
User-agent: Claude-Web
Disallow: /2025/
Allow: /
User-agent: Google-Extended
Disallow: /2025/
Disallow: /2024/
Allow: /
# Block Perplexity (competes for traffic)
User-agent: PerplexityBot
Disallow: /
# Allow archive (2023 and older)
Allow: /2024/
Allow: /2023/
Allow: /2022/
# Sitemaps
Sitemap: https://yoursite.com/sitemap-current.xml
Sitemap: https://yoursite.com/sitemap-archive.xml
Template 5: Documentation Site
Use when: Developer docs, maximum discoverability needed
# robots.txt - Documentation Site
# Allow ALL AI crawlers (maximum visibility)
User-agent: *
Allow: /
# Explicit permissions for major platforms
User-agent: GPTBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: PerplexityBot
Allow: /
# Comprehensive sitemaps
Sitemap: https://docs.yoursite.com/sitemap.xml
Sitemap: https://docs.yoursite.com/api-reference-sitemap.xml
Sitemap: https://docs.yoursite.com/guides-sitemap.xml
Template 6: E-commerce
Use when: Product catalog, want AI shopping assistant visibility
# robots.txt - E-commerce
# Allow AI crawlers for product discovery
User-agent: GPTBot
Allow: /products/
Allow: /categories/
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
User-agent: Google-Extended
Allow: /products/
Allow: /categories/
Disallow: /cart/
Disallow: /account/
User-agent: PerplexityBot
Allow: /products/
Disallow: /cart/
Disallow: /account/
# Block from customer data
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /admin/
Disallow: /orders/
# Product sitemaps
Sitemap: https://yoursite.com/sitemap-products.xml
Sitemap: https://yoursite.com/sitemap-categories.xml
Implementation Steps
Step 1: Create robots.txt File
Local creation:
# Create file
nano robots.txt
# Or use text editor
# Save as: robots.txt (UTF-8, no BOM)
Example content:
User-agent: GPTBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: CCBot
Disallow: /
Sitemap: https://yoursite.com/sitemap.xml
Step 2: Upload to Root Directory
Via FTP/SFTP:
Upload to: /public_html/robots.txt
Or: /var/www/html/robots.txt
Or: /htdocs/robots.txt
Final URL: https://yoursite.com/robots.txt
Via cPanel:
1. Login to cPanel
2. File Manager
3. Navigate to public_html/
4. Upload robots.txt
5. Set permissions: 644
Via Git:
# Add to repository root
git add robots.txt
git commit -m "Add AI crawler controls"
git push origin main
# Deploy (depends on your setup)
Step 3: Verify Accessibility
Test URL directly:
Visit: https://yoursite.com/robots.txt
Should see plain text content (not 404, not redirect)
cURL test:
curl -I https://yoursite.com/robots.txt
# Should return:
HTTP/1.1 200 OK
Content-Type: text/plain
Step 4: Monitor Impact
Week 1: Verify blocking
# Check server logs
grep "GPTBot" /var/log/access.log
# If blocked:
GPTBot ... 403 Forbidden
# If allowed:
GPTBot ... 200 OK
Month 1: Measure citation changes
Before blocking CCBot:
- Citation Rate: 28%
After blocking CCBot:
- Citation Rate: 27% (-1pp)
- CCBot visits: 0
Conclusion: Minimal impact, keep blocked
Testing & Validation
robots.txt Tester Tools
1. Google Search Console
URL: https://search.google.com/search-console
Steps:
1. Add property (yoursite.com)
2. Go to: Legacy Tools → robots.txt Tester
3. Enter path to test: /blog/post
4. Select User-agent: GPTBot
5. Click "Test"
Result:
✓ Allowed
✗ Blocked
2. Online robots.txt Validators
- https://www.robotstxt.org/validator/
- https://technicalseo.com/tools/robots-txt/
- https://en.ryte.com/free-tools/robots-txt-validator/
Upload or paste your robots.txt
Get syntax error detection
3. Manual Testing
# Test if crawler can access
curl -A "GPTBot/1.0" https://yoursite.com/blog
# Should return:
200 OK (if allowed)
403 Forbidden (if blocked by server)
# Note: robots.txt is advisory
# Server must enforce blocks separately
Common Validation Errors
Error 1: Wrong location
❌ https://yoursite.com/assets/robots.txt
❌ https://yoursite.com/admin/robots.txt
✓ https://yoursite.com/robots.txt
Error 2: Wrong file type
❌ robots.txt.txt
❌ Robots.TXT
❌ robots.html
✓ robots.txt
Error 3: Syntax errors
❌ User-Agent: GPTBot (wrong case on "agent")
✓ User-agent: GPTBot
❌ disallow: /admin (lowercase directive)
✓ Disallow: /admin
❌ User-agent: GPTBot Allow: / (same line)
✓ User-agent: GPTBot
Allow: /
Troubleshooting
Issue 1: Crawler Still Accessing Blocked Content
Possible causes:
1. robots.txt cached by crawler
→ Wait 24-48 hours for re-crawl
2. Syntax error in robots.txt
→ Validate with online tool
3. robots.txt not at root
→ Move to https://yoursite.com/robots.txt
4. Server not enforcing (robots.txt is advisory)
→ Implement server-level blocks
Server-level enforcement (Apache):
# .htaccess
RewriteEngine On
# Block GPTBot from /premium/
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC]
RewriteCond %{REQUEST_URI} ^/premium/ [NC]
RewriteRule .* - [F,L]
Issue 2: Search Engine Crawler Blocked Unintentionally
Symptom: Google can't index your site
Cause: Blocked Googlebot in robots.txt
Fix:
# Was:
User-agent: *
Disallow: /
# Should be:
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: * # AI crawlers
Disallow: /
Issue 3: No Impact on Citations
Symptom: Blocked crawler but still getting cited
Explanation: Model already trained on your content
Solution:
robots.txt blocks future training, not current knowledge
To reduce existing citations:
1. Wait for next model training cycle (6-12 months)
2. Remove/update content
3. Request removal from AI company
📚 Related Topics
Crawler Management:
Strategy:
Monitoring:
🆘 Need Help?
robots.txt Support:
Tools:
Last updated: 2025-01-18 | Edit this page