What is robots.txt?

robots.txt is a plain-text file stored at the root of your domain (e.g., https://example.com/robots.txt) that follows the Robots Exclusion Protocol. Web crawlers read this file before visiting your site to understand which pages they are permitted to crawl.

Syntax reference

User-agent: *: Applies to all crawlers. Use a specific name (e.g., Googlebot) to target one.
Allow: /path/: Explicitly permits a path that would otherwise be disallowed.
Disallow: /path/: Prevents the crawler from accessing the path and anything below it.
Crawl-delay: N: Requests a pause of N seconds between requests (not honoured by Google).
Sitemap: URL: Points crawlers to your XML sitemap for faster discovery.

Common AI crawler user-agents

Add separate rule blocks for any you want to block:

GPTBot: OpenAI
CCBot: Common Crawl (used by many AI models)
anthropic-ai: Anthropic Claude
Google-Extended: Google Bard / Gemini training
PerplexityBot: Perplexity AI

Deploying your robots.txt

Place the generated file at exactly /robots.txt on your web server. Most static site generators (Astro, Next.js, Gatsby) let you place a robots.txt in the public/ directory and it will be served automatically.

Robots.txt limitations

The robots.txt protocol is voluntary: well-behaved crawlers (Googlebot, Bingbot) respect it, but malicious scrapers and vulnerability scanners ignore it entirely. To protect genuinely private content, use:

HTTP authentication — requires a username and password to access the page.
noindex meta tag — prevents indexing even if a crawler does visit the page.
Server-side access controls — the only reliable way to block unwanted access.

robots.txt is best used to manage crawl budget (directing crawlers away from low-value pages) not for security.

Sitemap in robots.txt

You can list your sitemap URL(s) in robots.txt to help crawlers discover all indexed pages:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

Multiple sitemaps can be listed. All URLs must be absolute (including protocol). While submitting sitemaps directly through Google Search Console is the primary method, the robots.txt declaration helps crawlers from all engines discover them automatically.