Generally speaking, search agencies are hired to help businesses become as visible as possible to search engines.
Despite this, there are a few very good reasons why a business might want to restrict certain pages from being discovered.
In this article we’ll take a look at a few reasons why this might be the case, as well as ways to stop selected pages from being indexed:
Resolving Keyword cannibalisation
The issue of keyword cannibalisation is something that all websites should take seriously, as it has the potential to damage your site’s rankings.
This can happen when a site’s architecture relies on a single keyword or phrase that can be found in multiple places across a website. A common example of this would be for eCommerce websites, where from a user perspective, it would logical to nest a brand under a category, and at the same time, a category under a brand:
Having the same phrase targeted on two URLs, as well as within duplicate titles, header tags, and anchor text, can confuse search engines to the point where they can no longer find the actual page that you want to rank for, which can lower both conversions and the perceived quality of your site’s content.
One of the many ways of solving the issue is to block search engine spiders from indexing certain pages to help ensure that only the most relevant and highest converting pages are ranking for the correct terms. In this example, it would also be beneficial to use a rel=canonical to solve this, in addition to noindex as a robots directive.
For more about keyword cannibalisation, and how to get rid of it, read our previous blog on the issue.
Culling extraneous pages
Sometimes a site will need to create one or multiple URLs that serve little or no purpose to search engines.
Such pages can be anything from “thank you” or “confirmation” pages, to randomly generated URLs.
Having them viewable to search engines can inhibit and waste crawl budgets, as well as creating poor usability for users that click on pages directly from a SERP.
All the more, these pages will be viewed as “thin” by Google, which will in time cause issues with Google algorithms such as Google Panda.
Ensuring company privacy
From private correspondence to alpha products, test pages, and invoices, there are a plethora of security reasons why you’d want to keep search engines from accessing these pages (you would be amazed at the number of developers who allow personal data to be indexed in Google).
Avoiding duplicate content
Quite often a site might serve content from multiple places, such as;
- staging/testing servers
- international versions
- mobile sites
- printable pages
- PDF version.
Referencing one default version of the content for a given language and/or region is a constant battle of enterprise SEO projects that often deal with variables such as those listed above.
How to stop search engines from indexing your pages
As well as there being a range of reasons why you might want to stop search engines from indexing your pages, there are also a whole host of ways to do so.
Use ‘noindex’ to block search indexing
You can prevent a page from appearing in search by including a noindex meta tag in the header of a page, returning a ‘noindex’ header in the HTTP request, or by blocking access in the robots.txt file if you want to also block all access to the URL(s).
By doing so, when a Googlebot (other bots are available) next comes across that particular URL, it will respect the directive & drop it from Google Search results index (in time).
Regardless, for this directive to be effective at removing previously indexed URLs, the page must not be blocked by a robots.txt file. If that is the case, the crawler will never see the noindex directive and the page will still appear in the SERPs for a period of time until Google decides to remove it from its index as it can no longer crawl the URL(s).
To prevent the majority of crawlers from indexing a page on your site, you can place the following onto the <head> of your page. It should look like this:
<meta name="robots" content="noindex">
There are multiple types of user-agents which Google uses, however in order to prevent only Googlebot from accessing your page, you can use:
<meta name="googlebot" content="noindex">
HTTP response header
You can also return a X-Robots-Tag header with a value of noindex or “none” in your response. An example of this will look like the following:
HTTP/1.1 200 OK
Password protect your server directories
You can also block URLs by password protecting your server directories; imperative for if you have confidential or private content.
In all, this method can be the most effective and simplest way to block private URLs from appearing in search results.
By doing so, all web crawlers will be inhibited from accessing content.
Use robots.txt to slow down crawls and wastage
Robots.txt shouldn’t be used to stop your pages appearing in a search engine, as they can still appear in results if they have been linked to by other websites.
However, you can use robots.txt to control crawl traffic so that you do not waste crawl budget on unimportant pages.
The file itself should reside in the root of your website:
There are risks and limitations of using robots.txt, however:
- The file is only valid for the full domain it resides on and it’s worth noting that different search engines will interpret directives differently.
- Instructions in robots.txt files cannot enforce crawler behaviour on a site, as they only act as directives.
- While Googlebot and other crawlers will obey instructions in a robots.txt file, others might not.
- A robotted page can still be indexed if it is linked from other sites, which means that the URL address and other information (such as anchor text in links to the page) can still appear in search results.
If you want to know a little more about what pages you might want to hide from search engines, or if you are concerned about what might be publicly available on your website, get in touch with a member of our team via the contact page to see if we can help.