Skip to content
Toolcroft

Developer Tools

robots.txt Generator - Build Your Crawler Rules File

Generate a valid robots.txt file for your website. Control which bots can crawl which pages, add sitemap URLs, set crawl delays, and use presets for common configurations.

Your inputs are saved in this browser only. No data is ever sent to a server, and saved values won't be visible in other browsers or devices.

Presets

Crawler rules

Output2 rules

What is robots.txt?

robots.txt is a plain-text file stored at the root of your domain (e.g., https://example.com/robots.txt) that follows the Robots Exclusion Protocol. Web crawlers read this file before visiting your site to understand which pages they are permitted to crawl.

Syntax reference

  • User-agent: *: Applies to all crawlers. Use a specific name (e.g., Googlebot) to target one.
  • Allow: /path/: Explicitly permits a path that would otherwise be disallowed.
  • Disallow: /path/: Prevents the crawler from accessing the path and anything below it.
  • Crawl-delay: N: Requests a pause of N seconds between requests (not honoured by Google).
  • Sitemap: URL: Points crawlers to your XML sitemap for faster discovery.

Common AI crawler user-agents

Add separate rule blocks for any you want to block:

  • GPTBot: OpenAI
  • CCBot: Common Crawl (used by many AI models)
  • anthropic-ai: Anthropic Claude
  • Google-Extended: Google Bard / Gemini training
  • PerplexityBot: Perplexity AI

Deploying your robots.txt

Place the generated file at exactly /robots.txt on your web server. Most static site generators (Astro, Next.js, Gatsby) let you place a robots.txt in the public/ directory and it will be served automatically.

Robots.txt limitations

The robots.txt protocol is voluntary: well-behaved crawlers (Googlebot, Bingbot) respect it, but malicious scrapers and vulnerability scanners ignore it entirely. To protect genuinely private content, use:

  • HTTP authentication — requires a username and password to access the page.
  • noindex meta tag — prevents indexing even if a crawler does visit the page.
  • Server-side access controls — the only reliable way to block unwanted access.

robots.txt is best used to manage crawl budget (directing crawlers away from low-value pages) not for security.

Sitemap in robots.txt

You can list your sitemap URL(s) in robots.txt to help crawlers discover all indexed pages:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

Multiple sitemaps can be listed. All URLs must be absolute (including protocol). While submitting sitemaps directly through Google Search Console is the primary method, the robots.txt declaration helps crawlers from all engines discover them automatically.