If your website log file has always seemed like a bottomless silo of indecipherable data that just gets bigger and bigger, you could be missing out on crucial technical SEO value.

This guide will help you to understand the value of log file analysis for SEO and how to go about it, to identify opportunities for web marketing campaigns and especially for search engine marketing.

What is a log file?

Your website log file stores a list of all the requests made to your website’s hosting server by traffic including humans and search engine robots.

It’s anonymous data, but it does include certain identifying elements such as the originating IP address, the page or other content requested the date and time, and a ‘user-agent’ field that can distinguish between search robots and human users.

The data stored in a log file is useful when troubleshooting, as it can show when an error occurred, but its value for technical SEO should not be overlooked either.

How does this help SEO?

When a search engine crawls your website, it allocates a ‘crawl budget’, essentially a fixed number of pages it will scan before it quits and goes elsewhere.

The internet is vast and with some websites hosting many thousands of pages, crawl budgets are a way for search engine robots to avoid getting stuck on one site for a very long period of time.

In addition, a crawl budget acts as a safeguard against websites that use URL parameters or dynamic URLs incorrectly, which might otherwise generate an infinite set of dynamic pages that the robot would never automatically leave.

Using your crawl budget wisely

You can’t control your crawl budget – the search engines set that themselves – but you can use it wisely by making sure your best content is crawled and indexed.

Logfile analysis is a way to see which pages and other content (URLs and URIs) have been requested by Googlebot and other search robots recently, so you can tackle any technical issues that might be distracting them from your best content.

Eliminating ‘thin’ content and fixing dynamic pages and poorly handled URL parameters can all optimise the content that is indexed from your site, so your search visibility and rankings improve.

How to identify search robot requests

When looking at your log file, one of the key elements included is the ‘user-agent’ value. You can filter on this value to include only robot traffic, and not human user requests.

If you’re primarily interested in Google, look for the following user-agents:

  • Googlebot (both Desktop and Smartphone crawling)
  • Googlebot-Image (for Google Images indexing)
  • Googlebot-News (for Google News properties)
  • Googlebot-Video (for Google Video indexing)

If you need to distinguish between Desktop and Smartphone requests, look at the full user-agent string:

  • Desktop
    • Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
    • Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/W.X.Y.Z Safari/537.36
    • Googlebot/2.1 (+http://www.google.com/bot.html)
  • Smartphone
    • Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

In both of the above, “W.X.Y.Z” represents the version of Chrome used by the crawler and will update over time – so use wildcards in your filters rather than a specific version number.

Other search engines have their own user-agent values, so you can filter down to as many or as few as you think is relevant for your search marketing activities.

Check for spoof data

Unfortunately, it is possible to spoof the user-agent field — so a request might appear to come from Googlebot when it actually does not.

The solution to this is a reverse DNS lookup to validate the originating IP address and check that it belongs to the specified user-agent.

Online tools exist to do this on a page-by-page basis, or you can use a script to automate reverse DNS checking, we have written a tool to verify Google bot IP’s to make this easy for smaller sites.

What log file data can tell you about search robots

There are a few different useful factors to identify from your filtered search robot log file data:

Crawl volume

How often is your website crawled by search engines, and which search engines crawl it more often than others?

This can be especially important for specific search engines like Yandex in Russia, Baidu in China, or Google Images and Google Video for multimedia-focused websites.

Crawl budget

As mentioned above, the search engines will crawl a finite number of pages on your website before giving up and going elsewhere.

Your server logs can show you what this number is, giving you a ‘crawl budget’ you can use to plan your site hierarchy and internal links, to direct the search robots to the right quantity of valuable pages.

Errors and redirects

Server logs include HTTP status codes. If something goes wrong, this can highlight missing or redirected content.

Page redirects are important for SEO when you move or delete old pages, but they can eat into crawl budget, so keep a close eye on them and use them wisely.

Ways to use server logs for SEO

All this information might be interesting, but how is it useful for SEO campaigns?

Crawl to traffic delay

One beneficial way to use server logs for SEO is to look at the first crawl date for a page and compare this against your analytics data to see when organic traffic started to arrive.

If your website is quite consistent in this regard, you can start to factor this lag into your seasonal SEO campaigns, so that you publish timely content early enough for users to find it, before an event like (for example) Black Friday takes place.

Meta robots and robots.txt

The ‘robots’ meta tag is a way to control access to a specific page for search robots. The robots.txt file sits in your root directory and can apply similar rules to entire sections of your site (or to your entire website in one fell swoop, so use it wisely!).

If your crawl budget is being wasted on thin content that has no SEO value at all, you might want to consider blocking those pages from the robots, so that they crawl your valuable content instead. You can keep the pages visible to human users if they hold some other kind of value or useful information.

Common errors to look for

As well as making positive changes to your website, your server log data can flag up problems that need attention. In addition to those already mentioned, some other examples include:

Robots.txt errors

Mistakes in robots.txt can have wide-reaching consequences. A common mistake is to block search robot access during the development of a new website, but forget to allow access once the site goes live.

If you’re not seeing any data at all from a specific search robot in your server logs, it’s worth checking your robots.txt file and your page header meta tags, to make sure you haven’t accidentally blocked them across your entire server.

Missing pages

Those HTTP status codes mentioned above can be instrumental in following any updates to your website, especially if you delete or move pages as part of those updates.

Use your server log data to see where a page or entire folder has moved, and make sure you provide the correct temporary (302) or permanent (301) redirect code as appropriate.

You might sometimes want to filter out the ‘200’ status code completely, as this indicates the page loaded without a problem.

A 404 error means the requested page was not found at all. There are a few solutions to this, either by adding a redirect to an equivalent existing page, or making sure your website has a custom 404 error page with a search box and links to useful index pages.

Combining data from multiple platforms

To dive even deeper into server log analysis, once the data is stripped and sanitised it can be used to export your data into appropriate data analysis tools such as Google Data Studio, where you can have a separate column for each parameter and apply formatting, formulae and analysis.

You can incorporate reports from other platforms too, such as website analytics, SEO tools.

Knowing how to process all this data is relatively advanced, but can be a powerful way to gain an in-depth technical overview of your server, website and SEO campaigns. If you are interested in an article for this please let us know and we will write one for you.

Conduct regular audits

SEO is not a one-time-only task. Maintaining a good search presence means publishing new content over time, building your website’s number of crawled and indexed pages, and responding to competitors publishing optimised content of their own.

As your website grows, and especially if you make changes to existing content, the likelihood of crawl errors increases too.

Server log analysis for SEO is a way to spot those errors, as well as optimising the way the search robots crawl new content on your site, so valuable pages are not missed during the next round of indexing.

For all these reasons, make it a regular admin task to check your server log data and take any necessary actions to protect your SEO value across your website, and consult a technical SEO expert if you’re in any doubt about how to proceed.