The SEO Guide to Robots.txt: Rules, Risks, and Best Practice

By Dan Taylor, Partner & Head of Innovation
November 1, 2019

The robots.txt file is a staple of technical SEO and should be included as standard in any comprehensive technical audit. It’s a way of communicating crawl instructions to search engine bots and other automated user-agents, controlling which areas of a website can and can’t be accessed.

What is Robots.txt?

Robots.txt is a plain text file, typically located at the root of a domain via the /robots.txt URI, for example: salt.agency/robots.txt. When accessed, it acts as the first point of reference for compliant crawlers before they begin crawling a website.

A robots.txt file can either set rules for individual user-agents or apply blanket instructions to all crawlers. These rules can allow or restrict crawling of a folder, URL pattern, or an entire website. For example, this directive prevents all compliant crawlers from accessing any part of the site:

User-agent: *
Disallow: /

Let’s take a closer look at robots.txt files. We’ll explore what they are, how to structure valid directives, and how to use advanced pattern matching. We’ll also advise on how to approach robots.txt strategically in the more complex modern SEO environment that now includes AI crawlers and large-scale data collection bots, alongside traditional search engines.

Robots.txt formatting

The robots.txt file has three officially supported directives under the current standard:

User-agent
Disallow
Allow

Other directives may appear in robots.txt files, such as crawl-delay, but these are not part of the official specification and are handled inconsistently by different crawlers.

The syntax itself is straightforward. Each directive consists of a field name, followed by a colon, and then the attribute you want the rule to apply to. For user-agent, this is the name of the crawler, such as Googlebot or Bingbot. For Disallow and Allow, this is a URI path or pattern, such as /wp-admin/.

One important detail to be aware of is that field names are case-sensitive, while attributes are not. If you want to block /wp-admin/, specifying /WP-Admin/ will not work.

User-agent

User-agent is a directive that states which crawler the rules apply to. The syntax it follows is:

User-agent: (attribute)

One option is to set the attribute as a wildcard, applying the rules to all user-agents. This is done by using the * symbol, like in the below:

User-agent: *

Alternatively, it can target an individual crawler by using the crawler’s name as the attribute. Common examples include:

User-agent: Googlebot Specify Google’s search user-agent
User-agent: Bingbot Specify Bing’s search user-agent
User-agent: DuckDuckBot Specify DuckDuckGo’s search user-agent
User-agent: YandexBot Specify Yandex’s search user-agent

Google uses multiple user-agents for different services, including Googlebot, Googlebot-Image, Googlebot-News, and others, and has published a full list.

Disallow

Disallow is the directive used to prevent user-agents from crawling specific parts of a website. It follows the same syntax pattern as user-agent. A common mistake, especially on staging environments, is leaving disallow without a value:

User-agent: *
Disallow:

This does not block anything. To block the entire site from crawling, you must specify that command with a / as below:

Disallow: /

To block specific areas of a site, you might use some of the following commands:

User-agent: *
Disallow: /wp-admin/
Disallow: /checkout/
Disallow: /cart/
Disallow: /user-account/

These rules prevent compliant crawlers from accessing those directories while still allowing the rest of the site to be crawled normally.

Allow

Allow, in contrast, can be used to explicitly permit crawling of specific URIs, even when their parent directory is blocked through disallow. Allow is particularly useful when you want to block an entire folder but still allow access to a small number of important URLs within it.

For example:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

In this case, all of /wp-admin/ is blocked from crawling, except for admin-ajax.php, which is required for many WordPress front-end features to function correctly.

Allow can also be used to prioritise more specific paths over broader disallow rules. When multiple rules apply, search engines like Google will follow the most specific matching directive.

It’s worth noting that allow is not universally supported in the same way by all crawlers, but it is respected by major search engines such as Google and Bing. As a result, allow should be used sparingly and with clear intent, primarily to support essential resources or critical URLs rather than as a general access control mechanism.

Comments in robots.txt

Robots.txt files are living documents and are often worked on by multiple teams, particularly in enterprise platforms. Comments are used to provide additional context and reduce the risk of conflicting changes.

Comments are added by starting a line with a # symbol.

Here’s a real-world example from Google’s own robots.txt file:

# Certain social media sites are whitelisted to allow crawlers to access page markup when links to google.com/imgres* are shared.
User-agent: facebookexternalhit
User-agent: Twitterbot
Allow: /imgres

This also demonstrates a practical use of Allow as an exception within a broader Disallow rule. All user-agents have been instructed not to not crawl the file path of /imgres, but special exceptions have been made for Facebook and Twitter.

Advanced pattern matching

As well as being able to block whole folders and exact URI paths, the robots.txt file can also be used with pattern matching for more complex requirements through the wildcard function, using the * symbol.

For example:

# Prevent all bots crawling the images subfolder
User-agent: *
Disallow: /images/

This rule instructs all compliant crawlers to avoid crawling any URL that begins with /images/, effectively blocking the entire images directory and all files within it from being accessed.

Or:

# Prevent all bots from crawling blog category pagination
User-agent: *
Disallow: /blog/*/page/

This blocks URLs such as /blog/category-name/page/3 without needing to list each category individually.

Wildcard examples

Wildcards allow you to apply crawl rules at scale by matching URL patterns, rather than having to list individual paths or parameters one by one. Let’s look at a few examples.

Blocking file types

This approach is sometimes used to stop crawlers accessing specific resource types.

User-agent: *
Disallow: *.js$
Disallow: *.css$
Disallow: *.json$

This approach should be used with extreme caution, as blocking CSS or JavaScript can prevent Google from rendering pages properly and lead to indexing or ranking issues.

Blocking all parameter URLs

Blocking all parameterised URLs can help reduce crawl waste on sites where filters and sorting options generate large volumes of low-value URLs.

User-agent: *
Disallow: /*?

This can be useful for sites with heavy faceted navigation, particularly in ecommerce, where parameter handling is not yet fully controlled.

Blocking specific parameters

Targeting individual parameters gives you more granular control and avoids the risks associated with blocking all parameterised URLs.

User-agent: *
Disallow: /*prefn*
Disallow: /*prefv*
Disallow: /*pmin*
Disallow: /*pmax*

This allows you to target known parameters rather than blocking all parameterised URLs.

Blocking internal search results

Internal search result pages are often considered low-quality or duplicative from a search engine perspective and can safely be excluded in many cases.

User-agent: *
Disallow: /search?q=*

This can be useful depending on your platform, but is not advisable if you’re using the internal search function to create landing pages.

Crawl-delay functionality

Crawl-delay is sometimes used to suggest how long a crawler should wait between successive requests, with the intention of reducing server load from aggressive bot activity, for example:

User-agent: *
Crawl-delay: 10

However, crawl-delay is not an official standard and is interpreted differently by different search engines. Google and Baidu, for instance, ignore crawl-delay entirely.

If your website is using crawl-delay to manage excessive bot activity, be aware that it can limit how much of your site is crawled, therefore slowing down new content discovery and negatively affecting organic performance.

Pros and cons of blocking crawlers

Choosing to block certain crawlers in robots.txt comes with a range of pros and cons. Some of the most important advantages and drawbacks to be aware of are listed below.

Pros

Reduces unnecessary crawl activity and server load.
Helps prevent crawl budget being wasted on low-value or infinite URLs.
Protects sensitive, proprietary, or non-public sections of a site.
Limits exposure of content to AI training pipelines when compliant bots are blocked.

Cons

Overly restrictive rules can prevent important pages from being discovered or refreshed.
Blocking rendering resources can negatively affect how search engines understand pages.
Some bots may ignore robots.txt entirely.

Blocking large data sources like Common Crawl can reduce secondary visibility across platforms that reuse that data.

Blocking crawlers in robots.txt should be a carefully considered decision. For most sites, the biggest risk isn’t under-blocking, but rather accidental over-blocking caused by poorly scoped wildcard rules or legacy directives that are no longer relevant.

AI crawlers and training data sources

Alongside traditional search engines, many websites are now crawled by AI-focused user-agents that collect content for model training, inference, or dataset creation. These crawlers operate separately from search indexing bots and are often used to power and train large language models and AI-driven answer engines.

Common AI and data collection crawlers include:

GPTBot
CCBot (used by Common Crawl)
AnthropicBot

In many cases, it’ll be beneficial to allow AI crawlers to access your content. As people increasingly search using LLMs, it’s important that your content is visible either as part of answers or in AI overviews. And this requires your content to be visible to crawlers.

However, there are some types of content and data you might not want AI bots to crawl. If your goal is to prevent your content from being accessed or reused by these systems, robots.txt can be used to explicitly disallow them:

User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: AnthropicBot
Disallow: /

Blocking Common Crawl is particularly significant, as it underpins a wide range of downstream datasets used across search, research, and AI products. Content included in Common Crawl can be reused, referenced, or incorporated into systems far beyond a single platform.

However, it’s important to understand that robots.txt is an advisory mechanism. That means only compliant crawlers will respect it, and some AI systems may source data indirectly from previously collected datasets.

Considerations before allowing AI crawlers

Here are a few key considerations to take into account before allowing AI bots to crawl your site:

Content quality and consistency – AI systems can only learn from content that’s already available. If parts of your messaging are outdated or unclear, that may be carried over when your brand is mentioned in AI-generated search responses.
Content sensitivity – ensure any sensitive content such as proprietary tools or confidential information will remain inaccessible. Robots.txt shouldn’t be the only layer of protection against this, but it plays a role in signalling to crawlers that you don’t want them to access these materials.
Crawl accessibility – your site’s key pages must be technically proficient, with strong internal linking and clear structures, or AI crawlers may not interpret them accurately.

In other words, allowing AI crawlers has benefits, but only for websites that are structurally and strategically prepared for it.

Which crawlers should you block or allow?

Rather than treating all crawlers as a single group, it’s more effective to think about them in categories and define access for each of these separately.

Search engine bots

Search engine bots such as Googlebot and Bingbot should almost always be allowed, as they’re crucial for organic visibility and indexing.

AI training and large-scale data collection bots

Some organisations, such as media publishers concerned about copyright, content reuse, or subscription monetisation, might choose to block AI training and large-scale data collection bots such as GPTBot or CCBot. In those cases, restricting access is a decision which aligns with their commercial models, built on gated access and protection of intellectual property.

For most commercial brands, however, blocking these crawlers can reduce long-term visibility in AI-powered search. This is because some AI-driven platforms, search features, and third-party tools rely on shared datasets rather than crawling the web directly.

AI assistants and inference-focused crawlers

AI assistants, such as Google Gemini or Perplexity AI, are becoming increasingly capable of crawling the web directly to find their answers. Allowing these to access your site can increase the likelihood that your brand, products, or thought leadership are referenced in their responses and summaries. As AI-assisted search grows in popularity, being included in these responses could have real long-term benefits for your brand.

Unverified, aggressive, or low-value crawlers

These are often the safest candidates for blocking, particularly if they create excessive server load or have no clear benefit to discovery or traffic.

As with all robots.txt rules, these decisions should be reviewed periodically. The search landscape is constantly evolving, and what works today may not be the best approach in future.

Common mistakes and misconceptions

There are several common misconceptions around what robots.txt rules actually do. Here are a couple of the most commonly-made mistakes, and how to avoid them.

Example 1: Wildcard rules blocking unintended URLs

This example highlights how wildcard usage can unintentionally broaden the scope of a disallow rule beyond what was originally intended.

Disallow: /widgets*

At first glance, this looks like it will only block the /widgets directory. In reality, the wildcard causes the rule to match any URL that starts with /widgets, regardless of what follows. That includes not only the intended directory and its subfolders, but also unrelated URLs that happen to share the same prefix.

As a result, this rule would block:

/widgets and /widgets/blue/, which may be intentional
/widgets.html and /widgets.php, which may be legitimate standalone pages
/widgets-and-thingamabobs, which is a completely different URL path

If the intent is to block only the /widgets directory and everything beneath it, the safer and more precise rule is:

Disallow: /widgets/

If, however, the goal is to block only the directory itself while allowing deeper paths such as /widgets/blue/ to be crawled, you can use the $ end-of-URL modifier:

Disallow: /widgets/$

This ensures the rule applies only when the URL ends exactly at /widgets/, avoiding unintended collateral blocking.

Example 2: Case sensitivity causing incomplete blocking

This example demonstrates how case sensitivity in URL attributes can lead to incomplete or inconsistent blocking.

Disallow: *.jpg

This rule blocks any URL that ends in .jpg using lowercase characters, which will successfully prevent crawling of common image URLs such as:

/image.jpg
/wp-content/uploads/image.jpg

However, robots.txt matching is case-sensitive for attributes, meaning the rule does not apply to URLs that use uppercase file extensions, such as:

/image.JPG

If a site serves images using mixed-case file extensions, this can result in partial blocking, where some images are excluded from crawling while others remain accessible.

Google’s robots.txt Tester can be used for validation in isolation, but you should also consider whether rules may have unintended impacts on other areas of the site.

Robots.txt strategy for SEOs

A strong robots.txt strategy should guide crawlers towards the pages that matter most. At a strategic level, robots.txt should be used to:

Prioritise crawl access to indexable, high-value content.
Eliminate crawl traps such as infinite pagination, faceted navigation, and internal search results.
Preserve crawl budget for large or complex sites.
Avoid blocking CSS, JavaScript, and other resources required for rendering.
Explicitly manage non-search crawlers, including AI and data collection bots, where appropriate.

Robots.txt should never operate in isolation. It works best alongside site-wide SEO best practices such as noindex directives, canonical tags, parameter handling, logical internal linking, and clean URL architecture.

From an operational standpoint, robots.txt should be reviewed regularly, and after website updates such as migrations, platform changes, international rollouts, and major feature releases. Search algorithms are constantly changing, and what worked for a site six months ago may not be the right approach today.

Let SALT help with your SEO strategy

Robots.txt is one of the simplest technical SEO files on the surface, but it carries risk. A single misplaced wildcard can quietly remove large sections of a site from search visibility, while a well-considered strategy can significantly improve crawl efficiency and long-term SEO performance.

If you’re auditing robots.txt as part of a wider technical SEO review, or rethinking your approach in light of AI crawlers and evolving search behaviour, SALT can help.

If you’re unsure whether your current robots.txt setup is helping or hindering your site’s performance, get in touch with our expert team today.