Skip to main content

Robots.txt Generator

Generate robots.txt files with user-agent rules, allow/disallow paths and sitemap.

Reviewed by · Last reviewed

Rule 1
Generated robots.txt
User-agent: *
Disallow:

How to Use the Robots.txt Generator

Define per-crawler rules, list the paths you want to allow or block, attach your sitemap, and copy the generated file straight into the root of your website. The entire generator runs in your browser, so no URL patterns or site structure are transmitted anywhere.

  1. Choose a user-agent — Start with * to apply rules to every crawler, or specify a named bot such as Googlebot, Bingbot, DuckDuckBot or GPTBot when you need per-bot behavior.
  2. Add Disallow paths — List the URL prefixes you want to keep out of crawlers, one per line. A path of /admin/ blocks everything under that folder; a bare / blocks the entire site.
  3. Add Allow exceptions — Use Allow entries to carve out specific paths that should remain crawlable even though a broader Disallow applies. Allow rules take precedence when they are more specific than the matching Disallow.
  4. Attach your sitemap — Paste the absolute URL of your sitemap.xml. The Sitemap directive is global, so a single entry applies to all user-agents.
  5. Copy or download — Copy the generated text or download the file, then upload it to the root of your domain so it is reachable at https://yourdomain.com/robots.txt.

About the Robots Exclusion Protocol

The Robots Exclusion Protocol (REP) began as an informal convention proposed by Martijn Koster in 1994 after his own server was overwhelmed by an early web crawler. For nearly three decades it remained a de facto standard, and every major search engine agreed on the file format and location. In September 2022 the IETF published RFC 9309, which finally standardized the protocol and formalized the syntax for user-agent groups, allow and disallow rules, and path matching.

Robots.txt works on trust — well-behaved crawlers read the file and obey its directives, but it is not an access-control mechanism. A determined scraper or a malicious bot can ignore the file entirely, which is why sensitive material must be protected by authentication or IP restrictions rather than a Disallow rule.

The file must be served as plain UTF-8 text from the root of the host, with a maximum size of 500 kibibytes according to RFC 9309. Crawlers typically cache robots.txt for up to 24 hours, so changes can take a day to take effect across all bots.

Examples

A minimal file that allows full crawling and points to a sitemap:

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

A file that blocks the admin area and private uploads for every crawler, but still lets Googlebot reach a public PDF inside the blocked folder:

User-agent: *
Disallow: /admin/
Disallow: /uploads/private/

User-agent: Googlebot Allow: /uploads/private/whitepaper.pdf Disallow: /uploads/private/

Sitemap: https://example.com/sitemap.xml

How Path Matching Actually Works

Paths are matched as case-sensitive prefixes against the part of the URL after the host, so Disallow: /Admin does not block /admin, and a rule of /blog sweeps up /blog/, /blog-archive and /blogroll alike. For the folder alone, write Disallow: /blog/ with the trailing slash. The asterisk * matches any run of characters and $ anchors a URL end, so /*?sort= blocks sorted duplicates and /*.css$ re-allows stylesheets. When an Allow and a Disallow both match, RFC 9309 gives the win to the rule with more path characters, ties going to Allow.

A Worked Example and Common Mistakes

On WordPress, block only /wp-admin/ while allowing /wp-admin/admin-ajax.php, and leave /wp-content/ crawlable because it holds the assets Google needs to render the page. On a faceted catalog, block the parameter URLs that multiply infinitely (/*?color=, /*?sort=) while clean canonical URLs stay open. Three classic mistakes: a leftover Disallow: / from a staging build silently deindexes a site; a public Disallow advertises the very file it tries to hide; and blocking an already-indexed page freezes the stale listing because Google can no longer fetch the noindex that would drop it.

When to Use a Robots.txt File

  • Keep crawl budget focused — Steer bots away from infinite faceted-search URLs, internal search result pages, or archive calendars that waste crawl budget.
  • Block staging or dev environments — Prevent a pre-production mirror from being indexed by adding a site-wide Disallow on that host.
  • Control AI training crawlers — Disallow bots such as GPTBot, ClaudeBot, CCBot, Google-Extended or PerplexityBot if you do not want content used for model training.
  • Reference your sitemap — Advertise sitemap.xml so crawlers discover new URLs quickly even when your internal linking is sparse.
  • Throttle aggressive bots — Use Crawl-delay (honored by Bing and Yandex, ignored by Google) to slow non-critical crawlers on a resource-constrained server.
  • Keep duplicate URLs out — Block tracking-parameter URLs or printer-friendly duplicates that would otherwise dilute your crawl-to-index ratio.

Frequently Asked Questions

What is robots.txt?

Robots.txt is a plain-text file placed at the root of a website that tells compliant web crawlers which paths they should or should not fetch. The format is defined by the Robots Exclusion Protocol, standardized as IETF RFC 9309 in 2022. Each group of rules starts with one or more User-agent lines followed by Allow and Disallow directives, and optional Sitemap lines can appear anywhere in the file.

Where do I place robots.txt?

Place it in the document root of each host so it is reachable at <code>https://yourdomain.com/robots.txt</code>. A file on a subdirectory like <code>/site/robots.txt</code> is ignored. Each subdomain is treated as its own host, so <code>blog.example.com</code> needs its own robots.txt separate from <code>www.example.com</code>. Protocol and port also matter — HTTPS and HTTP are distinct origins for crawling purposes.

Does robots.txt block pages from appearing in Google?

No. Robots.txt prevents crawling, not indexing. Google can still index a URL it is not allowed to crawl if it discovers the URL through external links, and will show a bare listing with no snippet. To keep a page out of search results entirely, allow crawling and add a <code>&lt;meta name="robots" content="noindex"&gt;</code> tag or an <code>X-Robots-Tag: noindex</code> HTTP header so Google can read the directive.

What is the difference between crawling and indexing?

Crawling is the act of fetching a URL and its resources. Indexing is the act of storing the crawled content in a searchable database so it can be ranked for queries. Robots.txt only controls crawling. A page can be indexed without being crawled (via external links) or crawled without being indexed (via <code>noindex</code>). Confusing the two is the most common robots.txt mistake.

Should I block CSS and JavaScript in robots.txt?

No. Google has stated since 2015 that it renders pages like a modern browser and needs to fetch CSS, JS and image assets to understand layout, mobile-friendliness and Core Web Vitals. Blocking <code>/wp-content/</code>, <code>/static/</code> or CDN paths can cause Google to see a broken page and demote it in rankings. Leave your asset directories crawlable and use <code>noindex</code> on the individual HTML pages you want to hide.

Can I have multiple user-agent rules?

Yes. A robots.txt file can contain any number of user-agent groups. A crawler reads the whole file, picks the single group whose User-agent line most specifically matches its name, and ignores the others — including the <code>*</code> wildcard group if a more specific match exists. That means once you add <code>User-agent: Googlebot</code>, Google will only obey that block and will not inherit rules from the <code>*</code> block, so repeat any shared rules inside the specific group.

What does the Allow directive do?

Allow is an exception to a broader Disallow. When the path in Allow is more specific than the path in a matching Disallow, the crawler may fetch that URL. This is the standard way to open a single file or subfolder inside an otherwise blocked area. Not all crawlers honor Allow (the original 1994 spec did not include it), but Googlebot, Bingbot and most modern bots do.

Does Crawl-delay work for Googlebot?

No. Googlebot has never implemented the <code>Crawl-delay</code> directive. It is honored by Bing, Yahoo and Yandex as a minimum number of seconds between fetches. To slow Googlebot, use the crawl-rate setting inside Google Search Console on the affected property, or return HTTP 503 responses during temporary overload and Google will back off automatically.

What happens if robots.txt returns an error or is missing?

A missing file (HTTP 404) is treated as "no restrictions" - crawlers assume the whole site is open. A persistent server error (HTTP 5xx) is treated as "fully disallowed" by Google for the duration of the failure, so a misconfigured server serving 503 on the robots.txt request can accidentally deindex a site. Make sure the file returns a clean 200.

More SEO & Web Tools