For websites to be indexed within the results pages of search engines, search engine web crawlers (often called a “spider” or “spiderbot”), must first explore their pages.
These crawlers provide essential information to search engines so that the engines can supply users with the most useful and accurate results.
In order for crawlers to efficiently investigate a website, however, this means that the site in question must be appropriately structured for navigation — this is where crawl depth comes into play.
What does crawl depth mean?
Put simply, crawl depth refers to the number of clicks, or pathways, that a page is away from the homepage of a website.
The homepage, therefore, has a crawl depth of zero, and when a crawler utilises a link to another page, this will have a crawl depth of one.
How close your page is to the homepage will depend on what kind of page it is, and how important it is to the website.
Websites with thousands of pages will, of course, have different crawl depths to a website with just a couple of hundred pages.
That said, any strategically important page should not have a crawl depth of five or more, as it would signal to the crawler that it is a page of less importance.
It’s also worth noting that a crawler will only investigate a certain number of layers, as at some point it will decide that it is no longer necessary to crawl any deeper.
When a site publishes new pages, whether as commercial or supporting content, it is essential to get them crawled as soon as possible.
How to avoid crawl depth issues
There are several strategies to implement and habits to avoid so that your site has a workable structure for both crawlers and users.
Ensure that you have at least one XML sitemap
XML sitemaps are used to show Google what URLs exist on a website and get crawled more than any other kind of sitemap (such as a video or image sitemap).
There can be many elements included in an XML sitemap, such as when a particular URL was last updated.
You can learn more about sitemaps and how to build them in this Google Search Console Help guide.
Inspect your pagination
Websites with a lot of content often use pagination so that they can quickly and easily provide content to users.
For instance, if a user visits a clothes site and searches for “medium white t-shirt”, through pagination, they will be provided with items within their specifications.
As a result, however, this can cause issues with crawling, as pagination can create deep pathways when either there are very few items within a page, or there is a long list of items.
You can avoid paginated related crawling issues by cutting down lists, offering more items per page, or by instructing crawlers to ignore low-quality pages.
To do the latter, you will need to access and modify your robots.txt file. Again, you can read about how to modify your robots.txt file in this Google Search Reference guide.
Limit dynamic URL crawling
A dynamic URL is designed to narrow down items within a site’s listing page, which will filter what information is displayed to users.
Typically used by ecommerce sites, dynamic URLs append parameters, which generate similar URLs.
This tactic can cause serious crawling issues when duplicates occur of important pages.
Although adding a canonical tag can stop the indexing of a page with a dynamic URL, it will not stop it from being crawled, so ensure to mark the links with a nofollow attribute.
Alternatively, you can block them through the parameter tool in Google Search Console and or through Bing Webmaster Tools.
On the off chance that your site needs URL parameters to serve content, only implement the above if you are confident it will not negatively affect your website.
You can read more about dynamic URLs and faceted navigation in this blog.
Check for excessive 301 redirects
When sites get migrated, it is sometimes the case that a batch of URLs will get linked to without a trailing slash. This can be an issue if the rest of the site uses trailing slashes.
If a user or crawler goes to such a URL, they will get 301 redirected.
An example of a URL with a trailing slash:
An example of a URL without a trailing slash:
Although a small number of URLs with or without a trailing slash isn’t necessarily a huge issue, if this is the case for URLs numbering the thousands, the problem simply compounds as Googlebot and others have to crawl more and more unnecessary URLs.
Always update links within your sent when URLs are changed so that you can limit the number of 301 redirects.
If you already have this issue, look into creating a rewrite rule to add or remove the slashes.