Robots.txt is a text file at your website's root that instructs search engine crawlers which URLs they can access. It uses the Robots Exclusion Protocol with directives like User-agent, Allow, and Disallow. Note that robots.txt controls crawling, not indexing - blocked pages can still appear in search results.
Robots.txt is a text file at your website’s root that instructs search engine crawlers which URLs they can access. It uses the Robots Exclusion Protocol with directives like User-agent, Allow, and Disallow. Note that robots.txt controls crawling, not indexing - blocked pages can still appear in search results.
What is Robots.txt?
Robots.txt is a plain text file that tells web crawlers which pages they can and cannot access. It follows the Robots Exclusion Protocol standard.
Key points:
- Located at domain.com/robots.txt
- Controls crawler access (not indexing)
- Followed by well-behaved bots
- Can reference sitemap location
Robots.txt Syntax
Basic Structure
User-agent: *
Disallow: /admin/
Allow: /admin/public/
Sitemap: https://example.com/sitemap.xml
Directives
| Directive | Purpose |
|---|---|
| User-agent | Specifies which crawler rules apply to |
| Disallow | Blocks access to specified paths |
| Allow | Permits access (overrides Disallow) |
| Sitemap | Points to XML sitemap location |
| Crawl-delay | Requests delay between requests (not Google) |
User-agent Examples
# All crawlers
User-agent: *
# Google only
User-agent: Googlebot
# Google Images
User-agent: Googlebot-Image
# Bing
User-agent: Bingbot
Common Robots.txt Configurations
Allow Everything
User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xml
Block Everything
User-agent: *
Disallow: /
Warning: This blocks your entire site from crawling.
Standard Configuration
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /private/
Disallow: /search
Disallow: /*?*
Allow: /
Sitemap: https://example.com/sitemap.xml
Block Specific Bots
# Allow all bots
User-agent: *
Disallow:
# Block bad bots
User-agent: BadBot
Disallow: /
User-agent: AnotherBadBot
Disallow: /
Pattern Matching
Wildcards
# Block all PDF files
User-agent: *
Disallow: /*.pdf$
# Block all URLs with parameters
Disallow: /*?
# Block specific parameter
Disallow: /*?sort=
Path Matching
| Pattern | Matches |
|---|---|
| /admin | /admin, /admin/, /admin/page |
| /admin/ | /admin/, /admin/page (not /admin) |
| /*.pdf$ | Any URL ending in .pdf |
| /page/* | /page/anything |
What to Block
Commonly Blocked
User-agent: *
# Admin areas
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /login/
# Internal search
Disallow: /search
Disallow: /*?s=
# Duplicate content
Disallow: /print/
Disallow: /*?print=
# Staging/test
Disallow: /staging/
Disallow: /test/
# API endpoints
Disallow: /api/
# User-generated pages
Disallow: /my-account/
What NOT to Block
Never block:
- CSS files (needed for rendering)
- JavaScript files (needed for rendering)
- Images you want indexed
- Pages you want to rank
- Canonical URLs
Robots.txt vs Noindex
| Method | Effect |
|---|---|
| robots.txt Disallow | Prevents crawling, not indexing |
| noindex meta tag | Prevents indexing (must be crawled to work) |
| X-Robots-Tag header | Prevents indexing (server-level) |
Important: If you block a page via robots.txt AND add noindex, the noindex won’t be seen because the page isn’t crawled.
Correct Approach
To hide from search results:
- Allow crawling in robots.txt
- Add noindex meta tag to page
- Google crawls, sees noindex, doesn’t index
Wrong approach:
- Block in robots.txt
- Add noindex
- Google can’t see noindex
- Page might still appear in results
Testing Robots.txt
Google Search Console
- Go to Search Console
- URL Inspection tool
- Test how Googlebot sees your robots.txt
robots.txt Tester
- Search Console > Settings
- robots.txt Tester (classic)
- Enter URLs to test
Online Validators
- Google’s robots.txt tester
- Bing Webmaster Tools
- Third-party validators
Common Mistakes
1. Blocking CSS/JS
# Wrong - breaks rendering
User-agent: *
Disallow: /wp-includes/
Disallow: /wp-content/
# Right - allow static assets
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
2. Blocking Important Pages
# Accidentally blocking blog
User-agent: *
Disallow: /blog
# Should be specific
User-agent: *
Disallow: /blog/draft/
3. Case Sensitivity Issues
# These are different
Disallow: /Admin/
Disallow: /admin/
4. Missing Trailing Slash Issues
# Blocks /admin but not /administrator
Disallow: /admin
# Blocks only /admin/ paths
Disallow: /admin/
Robots.txt for Different Platforms
WordPress
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /trackback/
Disallow: /feed/
Disallow: /*?replytocom=
Sitemap: https://example.com/sitemap_index.xml
Astro/Static Sites
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
E-commerce
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /wishlist/
Disallow: /*?add-to-cart=
Disallow: /*?orderby=
Disallow: /*?filter
Sitemap: https://example.com/sitemap.xml
Robots.txt Checklist
Setup
- File at domain.com/robots.txt
- Valid syntax (test with tools)
- Sitemap referenced
- Not blocking important content
Security Considerations
- Admin areas blocked
- API endpoints blocked (if needed)
- Sensitive paths blocked
- Static assets NOT blocked
Verification
- Tested in Search Console
- CSS/JS accessible to crawlers
- Important pages crawlable
- No accidental blocks
Conclusion
Robots.txt controls crawler access but doesn’t prevent indexing. Use it to block admin areas, internal search, and non-public sections while ensuring important content remains crawlable.
Always test changes with Google’s tools before deploying. Remember that blocking crawling is different from blocking indexing - use noindex for pages you want hidden from search results.
Combine proper robots.txt configuration with XML sitemaps and technical SEO best practices for optimal crawling and indexing.