How to use the Robots Exclusion Protocol Effectively

By Chris Taylor, Managing Principal Consultant
August 24, 2020

The Robots Exclusion Protocol, or REP, is a way to tell search engine robots – including Google’s web crawler, known as Googlebot – that you don’t want them to crawl or index certain parts of your website.

It’s a useful way to protect sensitive data from showing up in the search results, and it’s commonly used during development of new websites to ensure the unfinished site does not accidentally appear in public searches.

REP can be defined using a text file placed in the root directory of your website, and because of this method, it is often referred to simply as ‘robots.txt’. There are several other places where you can give instructions to the search robots too.

How to Tell the Search Robots What to Do

You can give instructions to search robots using the following methods:

In your site’s root directory, using a robots.txt file.
In the HTTP header, using the X-Robots-Tag.
In the head of a page, using a robots meta tag.
In a hyperlink, using the rel attribute.

These methods all allow you to give the search robots certain instructions, for example:

Noarchive (Prevents search engines from offering a cached version of the page).
Nofollow (Stops the search robot following the hyperlinks on the page).
Noindex (Tells the search engine not to include the page in search results).
Nosnippet (Blocks the search engine from showing a snippet of the page text in its search results).

Arguably the most powerful of these is ‘noindex’ as it can prevent a page from appearing in the search results at all.

Which REP Method Should I Use?

If you want to give the search robots instructions for a specific page, you can do that by including a meta tag in the head of the page code:

This will prevent an individual page from being indexed in search results, and is useful if you have only one page (or a relatively small number) that you want to hide.

You can also give instructions specifically to Google by changing ‘robots’ to ‘googlebot’ in the meta tag:

The X-Robots-Tag HTTP header method also works on a per-page basis and can be used across a variety of filetypes.

If you want to apply a rule to your entire website, or to an entire directory – for example, while developing a new microsite that you don’t want to appear in the search results yet – then a robots.txt file could be the best option.

Basics of robots.txt

There are a few basics of robots.txt files that you should keep in mind if using this method:

UTF-8 Encoded ASCII

If you create or edit your robots.txt file in a plain text editor, make sure you save it in UTF-8 format. It can contain ASCII characters only – so use the ASCII percent code where appropriate, as you would in a URL (e.g. %20 for a space).

Case Sensitive

The URL of your robots.txt file and the rules contained within it are case-sensitive, so if you want a rule to apply to both the lower-case and upper-case versions of a page URL, you’ll need two separate rules.

Longest Rule Wins

In general, the longest matching rule takes precedence. This is easy to miss when applying multiple rules to the same root directory, but can be avoided by adding an asterisk, which serves as a wildcard, to the end of the shorter rule.

Multiple User-Agents

Just like the robots meta tag mentioned above, a robots.txt file can give instructions to a specific search robot or ‘user-agent’ such as ‘googlebot’. However, if you do this, you should be aware that the robot may ignore rules that are not directed at it – so you should restate your general rules for each specific user-agent.

Testing

It’s easy to make a mistake that leads to unexpected results from your robots.txt file, with the potential to hide your entire website from the search engines. Any new robots.txt file and any new individual rules should be thoroughly tested as the stakes are high. You can test your robots.txt with Google’s robot.txt testing tool.

Common Errors in robots.txt

Separate Rules for Separate Protocols

If your site supports multiple protocols, you should provide separate robots.txt files for each. For example, you will need a robots.txt file for the http protocol and another one for https.

The same is true of subdomains e.g. www.example.com, blog.example.com, shop.example.com and so on.

Unterminated Rules

Just as you can use an asterisk as a wildcard, you can also use a dollar sign to mark the end of a URL and delimit the permitted matches. For example:

Disallow: /*.html
Disallow: /*.html$

The first version of this rule looks for any page with .html in its URL. The second version is limited only to URLs that end with .html followed by no further characters.

Remember the longest matching rule is applied by the search robots. Terminating rules with a dollar sign, combined with asterisks as wildcards, gives you more flexibility to force your preferred rule to take precedence.

Forgotten robots.txt Files

As mentioned earlier, it’s common to add a catch-all robots.txt file during website development and testing, so the site does not get crawled or indexed before it is ready for public viewing.

One of the worse-case scenarios is when this file gets left behind on launch day – giving you a live version of your website that is completely hidden from search engines.

Benefits of the robots.txt REP

Adding a robots.txt file to your root folder is fast and effective, with the ability to block crawler access to your entire site during development or at any other time when you want to hide it quickly.

Once you know what you’re doing, it’s a very easy method to implement. While mistakes are possible, they usually fall into several simple areas such as case sensitivity, user-agent errors and longest rule matching, all of which an experienced developer can watch out for.

With careful testing using third-party tools like Google’s Search Console robots.txt report, you can verify that your rules are working as expected – and when you want to change anything, editing a single robots.txt file is much easier than editing meta tags on every individual page.