Typically, as part of a technical SEO audit or troubleshooting session, it is extremely important to analyse server logs. These logs generally contain all requests made to the web server and contain important information such as:
- Host name
- Date and Time of access
- URL they have accessed
- User Agent
- Method used
- HTTP status code
Below is an example of a line within a server log:
22.214.171.124 – – [13/Jul/2015:07:18:58 -0400] “GET /robots.txt HTTP/1.1″ 200 0 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)“
With this data, log files are extremely useful, as they let you find out what search engine crawlers are doing while visiting your website; so you can learn more about their actions.
Generally, it is easy to identify what requests are coming from Googlebots or other search engine crawlers by looking at the user agent. Unfortunately the issue comes from user agents that are faked and hence manipulating the data we have. Luckily, we have other information that allows us to verify if they are valid, and that is the host name containing the IP address of the computer requesting the file.
Using that information, we can easily run a DNS lookup followed by a forward DNS lookup on the accessing IP address and domain names.
According to Google, the majority of requests from Google bot come from googlebot.com or google.com, and from our own experience, we believe googlebot.com is the major domain associated with these search engine crawlers and not google.com.
As server log files tend to be very large, it is extremely time consuming to manually verify all these bots. To speed up the process, we have written a simple perl script to do this for us, while converting it to CSV at the same time – so we can run further analysis. The good news is that we would like to share it with you!
To use this script, you can simply run the following command, and the output will be a CSV that contains a list of Verified google bot accesses:
perl GoogleAccessLog2CSV.pl serverfile.log > verified_googlebot_log_file.csv
You can also get a file that includes invalid log lines by running the following command:
perl GoogleAccessLog2CSV.pl < serverfile.log > verified_googlebot_log_file.csv 2> invalid_log_lines.txt
Below is the source code of the script that has been added to Github’s gist for simplicity of access:
If you have any questions about how this script works, or need any further information, please do not hesitate to write in comments or contact us.