Robots.txt is a text file at your website's root that instructs search engine crawlers which URLs they can access. It uses the Robots Exclusion Protocol with directives like User-agent, Allow, and Disallow. Note that robots.txt controls crawling, not indexing - blocked pages can still appear in search results.

Robots.txt is a text file at your website’s root that instructs search engine crawlers which URLs they can access. It uses the Robots Exclusion Protocol with directives like User-agent, Allow, and Disallow. Note that robots.txt controls crawling, not indexing - blocked pages can still appear in search results.

What is Robots.txt?

Robots.txt is a plain text file that tells web crawlers which pages they can and cannot access. It follows the Robots Exclusion Protocol standard.

Key points:

  • Located at domain.com/robots.txt
  • Controls crawler access (not indexing)
  • Followed by well-behaved bots
  • Can reference sitemap location

Robots.txt Syntax

Basic Structure

User-agent: *
Disallow: /admin/
Allow: /admin/public/

Sitemap: https://example.com/sitemap.xml

Directives

DirectivePurpose
User-agentSpecifies which crawler rules apply to
DisallowBlocks access to specified paths
AllowPermits access (overrides Disallow)
SitemapPoints to XML sitemap location
Crawl-delayRequests delay between requests (not Google)

User-agent Examples

# All crawlers
User-agent: *

# Google only
User-agent: Googlebot

# Google Images
User-agent: Googlebot-Image

# Bing
User-agent: Bingbot

Common Robots.txt Configurations

Allow Everything

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

Block Everything

User-agent: *
Disallow: /

Warning: This blocks your entire site from crawling.

Standard Configuration

User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /private/
Disallow: /search
Disallow: /*?*
Allow: /

Sitemap: https://example.com/sitemap.xml

Block Specific Bots

# Allow all bots
User-agent: *
Disallow:

# Block bad bots
User-agent: BadBot
Disallow: /

User-agent: AnotherBadBot
Disallow: /

Pattern Matching

Wildcards

# Block all PDF files
User-agent: *
Disallow: /*.pdf$

# Block all URLs with parameters
Disallow: /*?

# Block specific parameter
Disallow: /*?sort=

Path Matching

PatternMatches
/admin/admin, /admin/, /admin/page
/admin//admin/, /admin/page (not /admin)
/*.pdf$Any URL ending in .pdf
/page/*/page/anything

What to Block

Commonly Blocked

User-agent: *
# Admin areas
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /login/

# Internal search
Disallow: /search
Disallow: /*?s=

# Duplicate content
Disallow: /print/
Disallow: /*?print=

# Staging/test
Disallow: /staging/
Disallow: /test/

# API endpoints
Disallow: /api/

# User-generated pages
Disallow: /my-account/

What NOT to Block

Never block:

  • CSS files (needed for rendering)
  • JavaScript files (needed for rendering)
  • Images you want indexed
  • Pages you want to rank
  • Canonical URLs

Robots.txt vs Noindex

MethodEffect
robots.txt DisallowPrevents crawling, not indexing
noindex meta tagPrevents indexing (must be crawled to work)
X-Robots-Tag headerPrevents indexing (server-level)

Important: If you block a page via robots.txt AND add noindex, the noindex won’t be seen because the page isn’t crawled.

Correct Approach

To hide from search results:

  1. Allow crawling in robots.txt
  2. Add noindex meta tag to page
  3. Google crawls, sees noindex, doesn’t index

Wrong approach:

  1. Block in robots.txt
  2. Add noindex
  3. Google can’t see noindex
  4. Page might still appear in results

Testing Robots.txt

Google Search Console

  1. Go to Search Console
  2. URL Inspection tool
  3. Test how Googlebot sees your robots.txt

robots.txt Tester

  1. Search Console > Settings
  2. robots.txt Tester (classic)
  3. Enter URLs to test

Online Validators

  • Google’s robots.txt tester
  • Bing Webmaster Tools
  • Third-party validators

Common Mistakes

1. Blocking CSS/JS

# Wrong - breaks rendering
User-agent: *
Disallow: /wp-includes/
Disallow: /wp-content/

# Right - allow static assets
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

2. Blocking Important Pages

# Accidentally blocking blog
User-agent: *
Disallow: /blog

# Should be specific
User-agent: *
Disallow: /blog/draft/

3. Case Sensitivity Issues

# These are different
Disallow: /Admin/
Disallow: /admin/

4. Missing Trailing Slash Issues

# Blocks /admin but not /administrator
Disallow: /admin

# Blocks only /admin/ paths
Disallow: /admin/

Robots.txt for Different Platforms

WordPress

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /trackback/
Disallow: /feed/
Disallow: /*?replytocom=

Sitemap: https://example.com/sitemap_index.xml

Astro/Static Sites

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

E-commerce

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /wishlist/
Disallow: /*?add-to-cart=
Disallow: /*?orderby=
Disallow: /*?filter

Sitemap: https://example.com/sitemap.xml

Robots.txt Checklist

Setup

  • File at domain.com/robots.txt
  • Valid syntax (test with tools)
  • Sitemap referenced
  • Not blocking important content

Security Considerations

  • Admin areas blocked
  • API endpoints blocked (if needed)
  • Sensitive paths blocked
  • Static assets NOT blocked

Verification

  • Tested in Search Console
  • CSS/JS accessible to crawlers
  • Important pages crawlable
  • No accidental blocks

Conclusion

Robots.txt controls crawler access but doesn’t prevent indexing. Use it to block admin areas, internal search, and non-public sections while ensuring important content remains crawlable.

Always test changes with Google’s tools before deploying. Remember that blocking crawling is different from blocking indexing - use noindex for pages you want hidden from search results.

Combine proper robots.txt configuration with XML sitemaps and technical SEO best practices for optimal crawling and indexing.

Frequently Asked Questions

Does robots.txt prevent pages from being indexed?
No, robots.txt only controls crawling, not indexing. If other sites link to a blocked page, it can still appear in search results (with no description). To prevent indexing, use noindex meta tags or X-Robots-Tag headers. The blocked page must be crawlable for Google to see the noindex directive.
Where should robots.txt be located?
Robots.txt must be at your root domain: https://example.com/robots.txt. It's case-sensitive and must be named exactly 'robots.txt' in lowercase. Subdomain robots.txt files only apply to that subdomain.
What happens if I don't have a robots.txt file?
Without a robots.txt file, search engines assume they can crawl all accessible pages. This is fine for most sites. Having no robots.txt is better than having a misconfigured one that accidentally blocks important content.