Robots.txt: Guide to Controlling Search Engine Crawlers

Robots.txt is a text file at your website's root that instructs search engine crawlers which URLs they can access. It uses the Robots Exclusion Protocol with directives like User-agent, Allow, and Disallow. Note that robots.txt controls crawling, not indexing - blocked pages can still appear in search results.

Robots.txt is a text file at your website’s root that instructs search engine crawlers which URLs they can access. It uses the Robots Exclusion Protocol with directives like User-agent, Allow, and Disallow. Note that robots.txt controls crawling, not indexing - blocked pages can still appear in search results.

What is Robots.txt?

Robots.txt is a plain text file that tells web crawlers which pages they can and cannot access. It follows the Robots Exclusion Protocol standard.

Key points:

Located at domain.com/robots.txt
Controls crawler access (not indexing)
Followed by well-behaved bots
Can reference sitemap location

Robots.txt Syntax

Basic Structure

User-agent: *
Disallow: /admin/
Allow: /admin/public/

Sitemap: https://example.com/sitemap.xml

Directives

Directive	Purpose
User-agent	Specifies which crawler rules apply to
Disallow	Blocks access to specified paths
Allow	Permits access (overrides Disallow)
Sitemap	Points to XML sitemap location
Crawl-delay	Requests delay between requests (not Google)

User-agent Examples

# All crawlers
User-agent: *

# Google only
User-agent: Googlebot

# Google Images
User-agent: Googlebot-Image

# Bing
User-agent: Bingbot

Common Robots.txt Configurations

Allow Everything

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

Block Everything

User-agent: *
Disallow: /

Warning: This blocks your entire site from crawling.

Standard Configuration

User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /private/
Disallow: /search
Disallow: /*?*
Allow: /

Sitemap: https://example.com/sitemap.xml

Block Specific Bots

# Allow all bots
User-agent: *
Disallow:

# Block bad bots
User-agent: BadBot
Disallow: /

User-agent: AnotherBadBot
Disallow: /

Pattern Matching

Wildcards

# Block all PDF files
User-agent: *
Disallow: /*.pdf$

# Block all URLs with parameters
Disallow: /*?

# Block specific parameter
Disallow: /*?sort=

Path Matching

Pattern	Matches
/admin	/admin, /admin/, /admin/page
/admin/	/admin/, /admin/page (not /admin)
/*.pdf$	Any URL ending in .pdf
/page/*	/page/anything

What to Block

Commonly Blocked

User-agent: *
# Admin areas
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /login/

# Internal search
Disallow: /search
Disallow: /*?s=

# Duplicate content
Disallow: /print/
Disallow: /*?print=

# Staging/test
Disallow: /staging/
Disallow: /test/

# API endpoints
Disallow: /api/

# User-generated pages
Disallow: /my-account/

What NOT to Block

Never block:

CSS files (needed for rendering)
JavaScript files (needed for rendering)
Images you want indexed
Pages you want to rank
Canonical URLs

Robots.txt vs Noindex

Method	Effect
robots.txt Disallow	Prevents crawling, not indexing
noindex meta tag	Prevents indexing (must be crawled to work)
X-Robots-Tag header	Prevents indexing (server-level)

Important: If you block a page via robots.txt AND add noindex, the noindex won’t be seen because the page isn’t crawled.

Correct Approach

To hide from search results:

Allow crawling in robots.txt
Add noindex meta tag to page
Google crawls, sees noindex, doesn’t index

Wrong approach:

Block in robots.txt
Add noindex
Google can’t see noindex
Page might still appear in results

Testing Robots.txt

Google Search Console

Go to Search Console
URL Inspection tool
Test how Googlebot sees your robots.txt

robots.txt Tester

Search Console > Settings
robots.txt Tester (classic)
Enter URLs to test

Online Validators

Google’s robots.txt tester
Bing Webmaster Tools
Third-party validators

Common Mistakes

1. Blocking CSS/JS

# Wrong - breaks rendering
User-agent: *
Disallow: /wp-includes/
Disallow: /wp-content/

# Right - allow static assets
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

2. Blocking Important Pages

# Accidentally blocking blog
User-agent: *
Disallow: /blog

# Should be specific
User-agent: *
Disallow: /blog/draft/

3. Case Sensitivity Issues

# These are different
Disallow: /Admin/
Disallow: /admin/

4. Missing Trailing Slash Issues

# Blocks /admin but not /administrator
Disallow: /admin

# Blocks only /admin/ paths
Disallow: /admin/

Robots.txt for Different Platforms

WordPress

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /trackback/
Disallow: /feed/
Disallow: /*?replytocom=

Sitemap: https://example.com/sitemap_index.xml

Astro/Static Sites

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

E-commerce

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /wishlist/
Disallow: /*?add-to-cart=
Disallow: /*?orderby=
Disallow: /*?filter

Sitemap: https://example.com/sitemap.xml

Robots.txt Checklist

Setup

File at domain.com/robots.txt
Valid syntax (test with tools)
Sitemap referenced
Not blocking important content

Security Considerations

Admin areas blocked
API endpoints blocked (if needed)
Sensitive paths blocked
Static assets NOT blocked

Verification

Tested in Search Console
CSS/JS accessible to crawlers
Important pages crawlable
No accidental blocks

Conclusion

Robots.txt controls crawler access but doesn’t prevent indexing. Use it to block admin areas, internal search, and non-public sections while ensuring important content remains crawlable.

Always test changes with Google’s tools before deploying. Remember that blocking crawling is different from blocking indexing - use noindex for pages you want hidden from search results.

Combine proper robots.txt configuration with XML sitemaps and technical SEO best practices for optimal crawling and indexing.

Frequently Asked Questions

Does robots.txt prevent pages from being indexed?

No, robots.txt only controls crawling, not indexing. If other sites link to a blocked page, it can still appear in search results (with no description). To prevent indexing, use noindex meta tags or X-Robots-Tag headers. The blocked page must be crawlable for Google to see the noindex directive.

Where should robots.txt be located?

Robots.txt must be at your root domain: https://example.com/robots.txt. It's case-sensitive and must be named exactly 'robots.txt' in lowercase. Subdomain robots.txt files only apply to that subdomain.

What happens if I don't have a robots.txt file?

Without a robots.txt file, search engines assume they can crawl all accessible pages. This is fine for most sites. Having no robots.txt is better than having a misconfigured one that accidentally blocks important content.