Robots.txt guide for SEOs
The robots.txt file is commonplace in any standard technical SEO audit, and the robots.txt standard gives webmasters the ability to control which robots (user-agents) are able to crawl their websites, either in part or entirely.
It’s a plain .txt file, which is most commonly found through the /robots.txt URI extension; e.g. salt.agency/robots.txt
A robots.txt file works by acting as a gatekeeper to a website for crawlers, such as Googlebot, and can be used to either define a set of rules for specific user-agents, or a blanket rule for all user agents. For example:
User-agent: * Disallow: /
The purpose of this article is to walk through the basics of what a robots.txt file is, how to create valid commands (directives) within the .txt file, and then how to disallow the crawling of URIs based on text-string wildcards.
Robots.txt formatting
The robots.txt file, as of September 2019, has three directives (field names). These are:
- User-agent
- Disallow
- Allow
The syntax for forming a valid robots.txt file is simple, after each command (or directive) you add a colon, followed by an attribute you wish the directive to apply to. For User-agent, this can be specified for an exact user-agent, such as Googlebot or Bingbot, and the Disallow and Allow fields URI paths, e.g. /wp-admin/
An important syntax formality to take note of is that while field names are case-sensitive, attributes are not, so if you want to exclude /wp-admin/ you can’t put /WP-Admin/ in the .txt file.
User-agent
User-agent is a field that declares the user-agent you want to specify. These are also known as bots, spiders, or crawlers. The syntax it follows is:
User-agent: (attribute)
The attribute can either be a wildcard, set to apply to all user-agents, as in the example at the start of the post, or if you want to specify a specific User-agent, you can use an AgentName as the attribute.
Each user-agent has it’s own identifier, but common ones include:
User-agent: Googlebot | Specify Google’s search user agent |
User-agent: AhrefsBot | Specify Ahref’s crawler |
User-agent: Bingbot | Specify Bing’s search user agent |
User-agent: DuckDuckBot | Specify DuckDuckGo’s search user agent |
User-agent: YandexBot | Specify Yandex’s search user agent |
Google uses multiple user-agents for various services and has published a full list.
Disallow & allow
Disallow is the directive used to prevent User-agents from crawling specific parts of the site. Allow, by contrast, can be used to allow User-agents to crawl specific URIs even if you’re blocking the folder path through Disallow.
Disallow
Disallow’s syntax follows the same naming pattern as User-agent: (attribute). However, a common mistake made in robots.txt files (especially on stanging websites) is that no attribute is specified:
User-agent: * Disallow:
Without a value, this won’t exclude anything, so for a site-wide block you would put:
Disallow: /
Or alternatively, if you wanted to block crawling of specific parts for the website, you would put something along the lines of:
User-agent: * Disallow: /wp-admin/ Disallow: /checkout/ Disallow: /cart/ Disallow: /user-account/
Comments in robots.txt
Robots.txt files are living documents and multiple webmasters (especially on enterprise platforms with multiple teams working on different sections), so being able to add comments to them can be very useful. To add in a comment line, start it with a # symbol.
A good example of this in practice is Google’s own robots.txt file that contains the entry:
# Certain social media sites are whitelisted to allow crawlers to access page markup when links to google.com/imgres* are shared. To learn more, please contact [email protected]. User-agent: Twitterbot Allow: /imgres
This is also a good use of Allow as a directive in the wild, as also in the robots.txt file is a directive instructing all user-agents to not crawl the file path of /imgres, but special exceptions have been made for Twitter bot.
Advanced pattern matching
As well as being able to block whole folders and exact URI paths, the robots.txt file can also be used with pattern matching for more complex needs through the wildcard function, this being the * symbol.
This can be used in a number of ways for user-agents:
# Prevent all bots crawling the images subfolder User-agent: * Disallow: /images/
And
# Prevent all bots from crawling blog category pagination User-agent: * Disallow: /blog/*/page/
This means that URI paths such as /blog/category-name/page/3 will be blocked from crawling, without having to specify each category and each pagination.
Wildcard examples
Blocking file types
User-agent: * Disallow: *.js$ Disallow: *.css$ Disallow: *.json$
When blocking file types, it’s important to not blog your CSS or JS files, as this will cause issues for Google when rendering your page.
Blocking all parameter URLs
User-agent: * Disallow: /*?
This can be useful, especially if your eCommerce website has a lot of faceted navigation and implementing nofollow on the filters isn’t immediately possible (although I’d do this on most Salesforce CC sites as standard to curb any potential index bloat).
Blocking specific parameters
User-agent: * Disallow: /*prefn* Disallow: /*prefv* Disallow: /*pmin* Disallow: /*pmax*
By the same token, you might not want to block all parameters, so if you know the parameter ID you can block them by wrapping them in *wildcards*.
Blocking search results pages
User-agent: * Disallow: /search?q=*
This can be useful depending on your platform/internal search function, but ill-advised if you’re using the internal search function to create landing pages.
Crawl-delay Functionality
Crawl-delay is still featured in a number of robots.txt files and is default on platforms such as Shopify. However, as it’s not an official directive, different search engines (and user-agents) handle it differently.
User-agent: * Crawl-delay: 10
Google ignores it, as does Baidu. Bing, however, treats it as a time window in which it can crawl the site once, i.e. once every X seconds, and Yandex respects it.
If crawl-delay is necessary, due to excessive Bing and Yandex bot activity, you need to consider how a long crawl-delay could affect the website, as it can limit how much of your website is crawled (and we know websites aren’t crawled in their entirety in a single bound).
By proxy, this can have an adverse effect on new content discovery and organic search performance.
Validations & Misconceptions
There are a number of misconceptions around the robots.txt file, and from experience, these tend to be around what is and isn’t “valid” for the job at hand.
Example 1
Disallow: /widgets*
This is seen as the equivalent to /widgets, and the trailing wildcard is ignored.
So, this matches and would disallow:
- /widgets
- /widgets.html
- /widgets/blue/
- /widgets/blue.html
- /widgets-and-thingamabobs
- /widgets-and-thingamabobs/green
- /widgets.php
- /widgets.php?filter=big-ones
And by contrast, does not match:
- /Widgets
- /Widgets.php
- /what-is-a-widget
- /products?filter=widgets
So, already there could be potential problems, did you mean to disallow the /widgets-and-thingamabobs subfolder as well? To disallow just the widgets subfolder, and subsequent folders, you should use:
Disallow: /widgets/
Or if you want to block that specific subfolder, but allow URIs such as /widgets/purple/ to be crawled, use the $ modifier:
Disallow: /widgets/$
As the $ specifies the directive to end the URI there.
Example 2
Disallow: *.jpg
This will prevent the crawling of:
- /image.jpg
- /wp-content/uploads/image.jpg
But will not match:
- /image.JPG
Of course, you can use Google’s robots.txt Tester to validate your implementations, but as is the case of example 1, you need to consider the website as a whole and if you may inadvertently block other URIs outside of the intended.