Where does robots.txt need to be placed on my site?

Robots.txt must live at the root of your domain and be accessible at https://yourdomain.com/robots.txt. Subdirectory locations like /blog/robots.txt are ignored by all major crawlers. Each subdomain requires its own file: blog.example.com/robots.txt is separate from example.com/robots.txt.

What is the difference between robots.txt and noindex?

Robots.txt controls crawling - it tells bots not to fetch a URL. The noindex directive controls indexing - it tells Google not to show the page in search results. The critical trap is that if you Disallow a URL, Google never fetches it and never sees the noindex tag, so the page can still appear in search results (URL only, no snippet) if other sites link to it. To truly remove a page from the index, leave it crawlable and add a noindex meta tag or X-Robots-Tag header instead.

Does Google still follow robots.txt in 2026?

Yes. Google, Bing, Yandex, and most major crawlers comply with RFC 9309, which standardized the robots exclusion protocol in September 2022. Compliance is voluntary, however - robots.txt is a request, not a firewall. Malicious scrapers ignore it entirely, and legitimate AI crawlers vary in how strictly they honor it.

How do I block ChatGPT from crawling my site?

Add a Disallow: / rule under User-agent: GPTBot to block OpenAI's training crawler. If you also want to stop your content from appearing in ChatGPT's live answers, add separate groups blocking OAI-SearchBot and ChatGPT-User. Most publishers block GPTBot to prevent training-data use but allow OAI-SearchBot and ChatGPT-User so their content can still be cited in ChatGPT responses.

Can robots.txt block AI bots from training on my content?

Robots.txt can instruct well-behaved AI training crawlers like GPTBot, ClaudeBot, and CCBot to stay away, and most major AI labs publicly commit to honoring these rules. However, compliance is voluntary and there is no technical enforcement - robots.txt does not block HTTP requests. For stronger protection, layer WAF rules or Cloudflare Bot Management on top, and use the Content-Signal directive (search=yes, ai-input=yes, ai-train=no) to signal your policy in a machine-readable format.

Robots.txt Explained: The Complete SEO Guide for 2026

Robots.txt is the smallest file on your site with the biggest potential to break it. A single misplaced slash can deindex an entire domain. A missing line can hand your training data to every AI crawler on the internet. This guide walks through the modern spec (RFC 9309), the 2026 AI crawler landscape, and the rules that actually move the needle, without the filler that pads most SEO posts.

What Is robots.txt?

Robots.txt is a plain-text file that lives at the root of your domain, always at /robots.txt. It tells well-behaved web crawlers which parts of your site they may fetch and which they should leave alone. Nothing more.

The protocol was informal for decades until the IETF standardized it in September 2022 as RFC 9309. Google, Bing, Yandex, OpenAI, Anthropic, Apple, and Perplexity all advertise compliance. Compliance is voluntary, though. Robots.txt is a request, not a firewall.

Two things robots.txt does not do:

It does not control indexing. A page you block from crawling can still appear in search results if other sites link to it (Google will show the URL with no snippet).
It does not provide security. Anyone, including malicious scrapers, can read your robots.txt. Listing /admin/ there just tells attackers where to look.

If you want to skip the syntax and ship something correct in two minutes, our Robots.txt Generator handles the boilerplate, AI crawler presets, and Cloudflare Content-Signal header in one click.

How Search Engines Use It

When a crawler first visits your site, it fetches /robots.txt before anything else. Per RFC 9309 the file must be:

Served over the same protocol and host as your site
Returned with Content-Type: text/plain and UTF-8 encoding
No larger than 500 KiB when parsed (anything beyond that limit is ignored)

The crawler then looks for the most specific User-agent group that matches its name. User-agent matching is case-insensitive (Googlebot equals googlebot), but path matching is case-sensitive (/Page is not the same as /page). This trips up teams running mixed-case URL schemes on Windows-origin servers.

Within a matching group, the crawler applies the longest-match-wins rule from RFC 9309 section 2.2.2. The rule with the most octets matching the requested URL prevails. When two rules tie in length, Allow beats Disallow. So:

text

User-agent: *
Disallow: /reports/
Allow: /reports/public/

A request for /reports/public/q4.pdf is allowed (longer match) even though /reports/ is disallowed.

Flowchart showing how Googlebot evaluates Allow and Disallow rules using longest-match-wins — Googlebot reads rules top-to-bottom within a group, but the **longest-match wins** - length beats order.

If your server returns 4xx for /robots.txt, Google treats the site as fully crawlable. A persistent 5xx (for more than 30 days) is also treated as fully crawlable, but in the short term Google pauses crawling. Always serve a valid file, even if it is just User-agent: * and a sitemap line.

Allow vs Disallow Rules

Every robots.txt is a series of groups. Each group starts with one or more User-agent lines and ends with Allow, Disallow, and (optionally) Sitemap directives.

Wildcards expand your options:

* matches zero or more of any character
$ anchors the match to the end of the URL

Three patterns cover most needs:

text

# Block a single file
User-agent: *
Disallow: /internal-memo.pdf

# Block an entire directory
User-agent: *
Disallow: /staging/

# Block a directory but allow one file inside it
User-agent: *
Disallow: /reports/
Allow: /reports/quarterly-summary.html

Sitemap is a top-level directive, not tied to any group. You can list multiple sitemaps, each on its own line, with absolute URLs:

text

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

Common Examples

The five patterns below cover roughly 90 percent of real-world robots.txt files.

Allow everything (the default if you have no file):

text

User-agent: *
Disallow:

Block everything (use only on staging, never production):

text

User-agent: *
Disallow: /

Block one bot, allow the rest:

text

User-agent: SemrushBot
Disallow: /

User-agent: *
Disallow:

Block search results and faceted-nav noise:

text

User-agent: *
Disallow: /search
Disallow: /*?sort=
Disallow: /*?filter=

A real production example. Stripe runs one of the cleanest robots.txt files on the web:

textstripe.com/robots.txt

User-agent: ia_archiver
Disallow: /

User-agent: *
Allow: /docs
Disallow: /handoff
Disallow: /sources/test_*

User-agent: rogerbot
Disallow: /

Sitemap: https://stripe.com/sitemap.xml

Three groups, one sitemap, no over-engineering. Notice the explicit Allow: /docs even though nothing above it disallows /docs. That is defensive clarity, useful when the file grows.

Generate your robots.txt

Robots.txt Generator

One-click presets for every major search and AI crawler, a Cloudflare Content-Signal toggle, and a validator that flags syntax errors before you ship.

Open tool

WordPress Setup

WordPress generates a virtual robots.txt by default that is usually fine. Most “WordPress SEO booster” rule sets you find online actively hurt rankings. The minimal, Yoast-endorsed version is:

text

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap_index.xml

That is it. Do not block:

/wp-includes/ (contains scripts Google needs to render pages)
/wp-content/plugins/ (same reason)
CSS or JS files anywhere
/?p=, /tag/, /category/, /author/ (Google explicitly warns against this)

Ecommerce Best Practices

Ecommerce sites have one job for robots.txt: keep crawlers focused on revenue pages, away from infinite parameter combinations and private user flows.

text

User-agent: *
Disallow: /cart
Disallow: /checkout
Disallow: /account
Disallow: /wishlist
Disallow: /search
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /*?utm_

User-agent: AdsBot-Google
Allow: /

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-products.xml

The dedicated AdsBot-Google group matters: Google Ads will refuse to serve shopping ads for products it cannot crawl, and AdsBot-Google does notobey the wildcard group. Always allow it explicitly. Shopify's production robots.txt does exactly this for AdIdxBot, Pinterestbot, and AdsBot-Google while blocking /cart, /checkout, /account, and a long list of campaign-parameter URLs.

Blocking AI Crawlers (2026)

This is where most older guides fall apart. The 2026 model is block training, allow retrieval: stop AI vendors from using your content to train their next foundation model, but keep your site visible inside ChatGPT, Claude, Gemini, and Perplexity answers (because those answers cite sources, and citations send traffic).

Diagram of AI crawlers (GPTBot, ClaudeBot, PerplexityBot) being filtered by a robots.txt gateway — Training crawlers (block) vs. retrieval bots (allow) - the distinction that most guides miss.

The user-agents you need to know:

Operator	Training crawler (often blocked)	Retrieval / search bot (usually allowed)
OpenAI	`GPTBot`	`OAI-SearchBot`, `ChatGPT-User`
Anthropic	`ClaudeBot`, `anthropic-ai`, `claude-web` (legacy)	`Claude-User`, `Claude-SearchBot`
Google (Gemini/Vertex)	`Google-Extended` (a token, not a bot)	`Googlebot` (unaffected)
Apple	`Applebot-Extended`	`Applebot`
Meta	`Meta-ExternalAgent`, `FacebookBot`	-
Perplexity	-	`PerplexityBot`, `Perplexity-User`
ByteDance	`Bytespider`	-
Common Crawl	`CCBot`	-
Cohere	`cohere-ai`	-
Amazon	`Amazonbot`	-

The Anthropic three-bot split

Anthropic, in particular, deserves attention because it ships three distinct user-agents and the difference matters:

ClaudeBot crawls the open web to gather training data. Block it if you do not want your content in future Claude models.
Claude-User fetches a page on demand when a user asks Claude a question that requires live web access. This is interactive, not training. Most publishers allow it.
Claude-SearchBot powers Claude's search-result citations. Allowing it is what gets you cited in Claude answers.

A blanket Disallow: / for anthropic-ai blocks all three and removes you from Claude entirely. The granular pattern most publishers want is:

text

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

The same logic applies to OpenAI: block GPTBot, allow OAI-SearchBot and ChatGPT-User. Apply this framework consistently across every vendor.

Cloudflare Content-Signal header

In September 2025, Cloudflare proposed a machine-readable extension called the Content Signals Policy. It adds a single line to robots.txt with three keys:

text

Content-Signal: search=yes, ai-input=yes, ai-train=no

The three signals mean:

search - appearing in traditional search engine results
ai-input - being used as live context in an AI answer (with citation)
ai-train - being included in training datasets for foundation models

Values are yes or no; omitting a signal means neutral. Cloudflare's managed robots.txt customers default to search=yes, ai-train=no. Google, OpenAI, and Anthropic have all publicly acknowledged the header. Adoption is still early and the signal is advisory, not enforced. Pair it with WAF rules if you need teeth.

A complete 2026-ready robots.txt for a publisher who wants AI citations but no training:

textrobots.txt

# Allow everyone by default
User-agent: *
Allow: /
Disallow: /admin/

# Block training-only AI crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Allow retrieval/search bots explicitly
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Machine-readable policy
Content-Signal: search=yes, ai-input=yes, ai-train=no

Sitemap: https://example.com/sitemap.xml

Mistakes That Hurt SEO

Six errors come up over and over in real audits.

Blocking CSS or JS. Google needs them to render the page. Blocking them kills mobile usability scores.
Pushing a staging Disallow: / to production. Set up a deploy hook that overwrites robots.txt based on environment.
Blocking /sitemap.xml. Yes, people do this. The sitemap belongs in robots.txt as a directive, not in a disallow.
Using Noindex: in robots.txt. Google deprecated the Noindex: directive on 1 September 2019. It is silently ignored today. Use a <meta name="robots" content="noindex"> tag or the X-Robots-Tag HTTP header instead.
Case-sensitivity slip-ups. Disallow: /Admin does not block /admin. Pick one casing and stick to it.
Trailing-slash confusion. Disallow: /reports blocks both /reports and /reports/ and /reports.pdf. Disallow: /reports/ only blocks paths inside that directory. Use $if you mean “exact match only.”

robots.txt vs noindex

This is the single most misunderstood point in technical SEO.

Robots.txt controls crawling. It tells bots not to fetch a URL.
noindex controls indexing. It tells Google not to show the page in search results.

The trap: if you Disallow a URL, Google never fetches it, which means Google never sees a noindextag on that page. If anyone links to that URL externally, Google may still index it (URL only, no snippet, no title). You will see it in Search Console as “Indexed, though blocked by robots.txt.”

To truly remove a page from the index, do the opposite of what feels natural: leave it crawlable and add a noindex meta tag or X-Robots-Tag: noindex header. Once Google has reprocessed it and dropped it from the index, you can then add the Disallow rule.

Generate robots.txt Online

Once you understand the rules, maintaining robots.txt is mostly a copy-paste exercise. That is what our tool exists to remove. The iToolVerse Robots.txt Generator gives you:

One-click presets for every major search and AI crawler (including the Anthropic three-bot split)
A Cloudflare Content-Signal toggle with search, ai-input, and ai-train keys
Sitemap and host directive fields
Instant copy, download, and a validator that flags syntax errors before you ship

Google Search Console robots.txt report showing parse status and last fetch time — Google Search Console → Settings → robots.txt - shows last-fetch time, parse status, file size, and lets you test any URL against the live rules.

After you upload the file, validate it inside Google Search Console under Settings → robots.txt. The report shows the last-fetch time, parse status, file size, and lets you test any URL against the live rules. It is the fastest sanity check available, and it surfaces problems the moment Google notices them.

Ship a correct robots.txt in two minutes

Robots.txt Generator

AI crawler presets, Cloudflare Content-Signal toggle, and a live validator - everything in one free tool.

Open tool

Robots.txt Explained: The Complete Beginner Guide for SEO in 2026

What Is robots.txt?

How Search Engines Use It

Allow vs Disallow Rules

Common Examples

Robots.txt Generator

WordPress Setup

Ecommerce Best Practices

Blocking AI Crawlers (2026)

The Anthropic three-bot split

Cloudflare Content-Signal header

Mistakes That Hurt SEO

robots.txt vs noindex

Generate robots.txt Online

Robots.txt Generator

Frequently asked questions

What Is My IP Address? IPv4, IPv6 and Public IP, Explained Clearly

How to Write Perfect Meta Tags for SEO in 2026

DNS Propagation Explained: Why DNS Changes Take Time (and How to Make Them Faster)

What Is robots.txt?

How Search Engines Use It

Allow vs Disallow Rules

Common Examples

Robots.txt Generator

WordPress Setup

Ecommerce Best Practices

Blocking AI Crawlers (2026)

The Anthropic three-bot split

Cloudflare Content-Signal header

Mistakes That Hurt SEO

robots.txt vs noindex

Generate robots.txt Online

Robots.txt Generator

Frequently asked questions

1Where does robots.txt need to be placed on my site?

2What is the difference between robots.txt and noindex?

3Does Google still follow robots.txt in 2026?

4How do I block ChatGPT from crawling my site?

5Can robots.txt block AI bots from training on my content?

Related guides

What Is My IP Address? IPv4, IPv6 and Public IP, Explained Clearly

How to Write Perfect Meta Tags for SEO in 2026

DNS Propagation Explained: Why DNS Changes Take Time (and How to Make Them Faster)