The robots.txt file is commonplace in any standard technical SEO audit, and the robots.txt standard gives webmasters the ability to control which robots (user-agents) are able to crawl their websites, either in part or entirely.

It’s a plain .txt file, which is most commonly found through the /robots.txt URI extension; e.g. salt.agency/robots.txt

A robots.txt file works by acting as a gatekeeper to a website for crawlers, such as Googlebot, and can be used to either define a set of rules for specific user-agents, or a blanket rule for all user agents. For example:

User-agent: *
Disallow: /

The purpose of this article is to walk through the basics of what a robots.txt file is, how to create valid commands (directives) within the .txt file, and then how to disallow the crawling of URIs based on text-string wildcards.

Robots.txt formatting

The robots.txt file, as of September 2019, has three directives (field names). These are:

  • User-agent
  • Disallow
  • Allow

The syntax for forming a valid robots.txt file is simple, after each command (or directive) you add a colon, followed by an attribute you wish the directive to apply to. For User-agent, this can be specified for an exact user-agent, such as Googlebot or Bingbot, and the Disallow and Allow fields URI paths, e.g. /wp-admin/

An important syntax formality to take note of is that while field names are case-sensitive, attributes are not, so if you want to exclude /wp-admin/ you can’t put /WP-Admin/ in the .txt file.

User-agent

User-agent is a field that declares the user-agent you want to specify. These are also known as bots, spiders, or crawlers. The syntax it follows is:

User-agent: (attribute)

The attribute can either be a wildcard, set to apply to all user-agents, as in the example at the start of the post, or if you want to specify a specific User-agent, you can use an AgentName as the attribute.

Each user-agent has it’s own identifier, but common ones include:

User-agent: Googlebot Specify Google’s search user agent
User-agent: AhrefsBot Specify Ahref’s crawler
User-agent: Bingbot Specify Bing’s search user agent
User-agent: DuckDuckBot Specify DuckDuckGo’s search user agent
User-agent: YandexBot Specify Yandex’s search user agent

Google uses multiple user-agents for various services and has published a full list.

Disallow & allow

Disallow is the directive used to prevent User-agents from crawling specific parts of the site. Allow, by contrast, can be used to allow User-agents to crawl specific URIs even if you’re blocking the folder path through Disallow.

Disallow

Disallow’s syntax follows the same naming pattern as User-agent: (attribute). However, a common mistake made in robots.txt files (especially on stanging websites) is that no attribute is specified:

User-agent: *
Disallow:

Without a value, this won’t exclude anything, so for a site-wide block you would put:

Disallow: /

Or alternatively, if you wanted to block crawling of specific parts for the website, you would put something along the lines of:

User-agent: *
Disallow: /wp-admin/
Disallow: /checkout/
Disallow: /cart/
Disallow: /user-account/

Comments in robots.txt

Robots.txt files are living documents and multiple webmasters (especially on enterprise platforms with multiple teams working on different sections), so being able to add comments to them can be very useful. To add in a comment line, start it with a # symbol.

A good example of this in practice is Google’s own robots.txt file that contains the entry:

# Certain social media sites are whitelisted to allow crawlers to access page markup when links to google.com/imgres* are shared. To learn more, please contact [email protected]
User-agent: Twitterbot
Allow: /imgres

This is also a good use of Allow as a directive in the wild, as also in the robots.txt file is a directive instructing all user-agents to not crawl the file path of /imgres, but special exceptions have been made for Twitter bot.

Advanced pattern matching

As well as being able to block whole folders and exact URI paths, the robots.txt file can also be used with pattern matching for more complex needs through the wildcard function, this being the * symbol.

This can be used in a number of ways for user-agents:

# Prevent all bots crawling the images subfolder
User-agent: *
Disallow: /images/

And

# Prevent all bots from crawling blog category pagination
User-agent: *
Disallow: /blog/*/page/

This means that URI paths such as /blog/category-name/page/3 will be blocked from crawling, without having to specify each category and each pagination.

Wildcard examples

Blocking file types

User-agent: *
Disallow: *.js$
Disallow: *.css$
Disallow: *.json$

When blocking file types, it’s important to not blog your CSS or JS files, as this will cause issues for Google when rendering your page.

Blocking all parameter URLs

User-agent: *
Disallow: /*?

This can be useful, especially if your eCommerce website has a lot of faceted navigation and implementing nofollow on the filters isn’t immediately possible (although I’d do this on most Salesforce CC sites as standard to curb any potential index bloat).

Blocking specific parameters

User-agent: *
Disallow: /*prefn*
Disallow: /*prefv*
Disallow: /*pmin*
Disallow: /*pmax*

By the same token, you might not want to block all parameters, so if you know the parameter ID you can block them by wrapping them in *wildcards*.

Blocking search results pages

User-agent: *
Disallow: /search?q=*

This can be useful depending on your platform/internal search function, but ill-advised if you’re using the internal search function to create landing pages.

Crawl-delay Functionality

Crawl-delay is still featured in a number of robots.txt files and is default on platforms such as Shopify. However, as it’s not an official directive, different search engines (and user-agents) handle it differently.

User-agent: *
Crawl-delay: 10

Google ignores it, as does Baidu. Bing, however, treats it as a time window in which it can crawl the site once, i.e. once every X seconds, and Yandex respects it.

If crawl-delay is necessary, due to excessive Bing and Yandex bot activity, you need to consider how a long crawl-delay could affect the website, as it can limit how much of your website is crawled (and we know websites aren’t crawled in their entirety in a single bound).

By proxy, this can have an adverse effect on new content discovery and organic search performance.

Validations & Misconceptions

There are a number of misconceptions around the robots.txt file, and from experience, these tend to be around what is and isn’t “valid” for the job at hand.

Example 1

Disallow: /widgets*

This is seen as the equivalent to /widgets, and the trailing wildcard is ignored.

So, this matches and would disallow:

  • /widgets
  • /widgets.html
  • /widgets/blue/
  • /widgets/blue.html
  • /widgets-and-thingamabobs
  • /widgets-and-thingamabobs/green
  • /widgets.php
  • /widgets.php?filter=big-ones

And by contrast, does not match:

  • /Widgets
  • /Widgets.php
  • /what-is-a-widget
  • /products?filter=widgets

So, already there could be potential problems, did you mean to disallow the /widgets-and-thingamabobs subfolder as well? To disallow just the widgets subfolder, and subsequent folders, you should use:

Disallow: /widgets/

Or if you want to block that specific subfolder, but allow URIs such as /widgets/purple/ to be crawled, use the $ modifier:

Disallow: /widgets/$

As the $ specifies the directive to end the URI there.

Example 2

Disallow: *.jpg

This will prevent the crawling of:

  • /image.jpg
  • /wp-content/uploads/image.jpg

But will not match:

  • /image.JPG

Of course, you can use Google’s robots.txt Tester to validate your implementations, but as is the case of example 1, you need to consider the website as a whole and if you may inadvertently block other URIs outside of the intended.