23 Jul

How To Compile Search Engine Crawl Statistics From Apache Server Logs For Your SEO Campaign?

Category: SEO

Web server access log files sometimes provide useful information like crawl statistics of web crawlers like Googlebot and Baiduspider, which are useful for your SEO campaigns. For example, you can determine how often Googlebot comes to index your web pages. This is important because if Googlebot indexes your web pages too infrequently, you should find out why this is the case as it could affect your website’s ranking. Crawl statistics is important for China SEO campaigns also.

In fact, the Google Webmaster Tools does provide Crawl Statistics that tell you how many times your website is crawled by Googlebot each day. However, not all website analytics tools provide such statistics. For example, the China dominating search engine, Baidu, does not provide crawl statistics. If you run China SEO campaigns that target Baidu, you cannot readily find those crawl statistics on Baidu website analytics tools. However, you can always work out crawl statistics of Baidu (or any website crawlers) yourself from Apache server access log files.

We have assumed here that you use Apache on Linux to power your website and that you know at least basic Linux commands. In case you are running Apache on Windows platform, you may still upload the Apache server log files to a Linux server (or Linux virtual machine) and use the commands below in this article.

First, check if Apache is writing to server logs which search engines access your website 

Depending on the parameter settings in the Apache configuration file, Apache may or may not identify the visitors of your website in the server access log files. Note that visitors can be human beings or search engine robots. For each access to a website, Apache writes a new line to the server access log file. If Google visits your website, you should find the word ‘Googlebot’ in the line. If Baidu visits your site, you should see ‘Baiduspider’ in the line. If Bing’s robot comes and indexes your web page, you should find ‘bingbot’ in the line.

By default, on a Linux machine, the Apache configuration file is located in:

/etc/httpd/conf/httpd.conf

If you are not familiar with Linux commands, especially vi or view commands, I suggest you to have a Linux administrator to modify the Apache configuration file for you, particularly if it is a production machine. When you are ready to go, login as the root user, then:

# cd /etc/httpd/conf
# view httpd.conf

Look for the CustomLog parameter in httpd.conf file. If the Linux server has several Apache Virtual Hosts running, go to the VirtualHost sections in the httpd.conf file, and look for the CustomLog parameter in each VirtualHost section corresponding to each Apache Virtual Host.

Make sure that the line with CustomLog is similar to the following:
CustomLog logs/host1_access_log combined

And NOT:
CustomLog logs/host1_access_log common

With the word ‘combined’ it means Apache will log the identity of those who visit your website, for both human and search engine robots. ‘Combined’ is the setting that we need. If it is already the case, you do not need to change the Apache configuration file. Otherwise, change ‘common’ to ‘combined’.

Whenever you amend the Apache log file, you need to restart Apache, you should ask a system administrator to do this for you if you are not familiar with how to do so.

Next, check the Apache server access log files

Apache server access logs are located under /var/log/httpd directory by default.

Now, change directory:

# cd /var/log/httpd

The server log file can be named something like host1_access_log. A typical line in the server log is:

66.249.77.14 - - [22/Jul/2013:05:51:00 +0800] "GET /sitemap.xml HTTP/1.1" 200 11993 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The above line tells you the date and time (22/Jul/2013:05:51:00) when the web page sitemap.xml was accessed by Googlebot.

Hence, in order to find out how many times Googlebot crawls your website, you can just count the number of lines in the log files. By default, Apache archives log files weekly and adds a timestamp at the end of each Apache server access log file name. For example, host1_access_log-20130630 was archived on June 30, 2013.

To count the number of times Googlebot crawls your website on say, June 30, 2013, enter the following command:

# grep "30/Jun/2013" host1_access_log* | grep "Googlebot" | wc –l
156

The above command essentially counts the number of lines in the Apache access logs that contain both “30/Jun/2013” and “Googlebot”. It shows that Googlebot crawled the website 156 times on June 30, 2013.

Provided that the log files that cover the period for which you want to collect crawl statistics still exist and have not been removed, you can manually run commands similar to the above to derive the number of crawls by any search engines each day. You may create a Linux script file to automate this process for each day in the past for which you want to derive crawl statistics.

Note that depending on the Linux settings, the date format may be different from that shown above. You are suggested to browse the contents of the Apache server access logs to determine the date format before you run the above command. If you use the wrong date format, you will end up getting 0's for all commands similar to the above.