Weblog Madness!

In the these days of Google Analytics it’s a bit passé to talk about boring of web server logs.  But there’s still good reasons to to go diving into the into the big text files generated by IIS or Apache.  In my poorly paid and humiliating day job I was recently asked to find out how popular our PDF publications were.  The trouble is that, being a traditional-style organisation, most staff members think of the internet as an email medium and send links to PDFs via email ‘blasts’.  Downloads this way can’t be picked up via the standard Google Analytics javascript-based tags.  We have a central library of publications and I made landing pages but that’s a bit like closing the gate after the horse has bolted, what about last year’s traffic?

The only true answer is to go look at the actual logs of what files were served to whom and when.  The data’s all in there!  There’s only two problems:

  1. Those are some big-ass files to filter
  2. Lots of downloads are by spiders, rather than people.

Problem #1 is pretty easy to fix.  Microsoft provides a command-line DOS tool as part of it’s IIS5 administrator’s toolkit (you can Google that) which will let you do SQL-like queries against W3C format logfiles (and lots of other log formats).  Problem #2 is a bit more work.  Using the user-agent parameter of a HTTP request we can spot the spiders and filter on them, but there’s a great many of them!  Building the WHERE clauses for the query is a major effort and you risk missing a bracket or comma somewhere.

The solution, as is to all life’s big problems, is to automation or, more specifically, scripts.  PowerShelll this time…

param([string]$inputFolder = "none", [string]$outputFile = "none", [string]$startDate = "none", [string]$endDate = "none", [string]$keyword = "none")

#Query Logs is a wrapper for LogParser.exe which allows SQL-like queries to logfiles
#It is set to query IIS logfiles for PDF downloads, to filter out web spiders and output in a useful format
#With parameters it can output a date range and filter on a keyword withing the PDF filename.

function output-help(){

"USAGE: .\queryLogs.ps1 -inputfolder xxx -outputfile xxx [-startdate xxx] [-enddate xxx] [-keyword xxx] "

}

switch("none"){

$outputFile {
"No output file specified!"
output-help
exit
}
$inputFolder {
"No input folder specified!"
output-help
exit
}

}

function buildRobotExcludeStatement([string]$botname){

"INDEX_OF(TO_LOWERCASE(cs(User-Agent)), TO_LOWERCASE('$botname')) = null"

}

$logparserLocation = "LogParser.exe"

$selectStatement = "SELECT date, cs-uri-stem, cs-uri-query, c-ip, cs(User-Agent)"
$fromStatement = "FROM $inputFolder\*.log TO $outputFile"

$whereStatement = "WHERE sc-status = 200 AND cs-method = 'GET' AND INDEX_OF(cs-uri-stem, '.pdf') > 0"

if($startDate -ne "none"){

$whereStatement = "$whereStatement AND date >= '$startDate'"

}

if($endDate -ne "none"){

$whereStatement = "$whereStatement AND date0"

}

$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement "http:")
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("robot"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("xenu"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("vse"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("urlchecker"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("TimKimSearch"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("Jakarta+Commons"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("bot"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("spider"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("yandex"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("Xerka"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("www"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("crawler"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("HttpComponents"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("leecher"))

$parameters = "-i:IISW3C `"$selectStatement $fromStatement $whereStatement`" -o:csv"
$command = "$logparserLocation $parameters"
$command
"(running)"
iex $command

You can cut that code out and take it to the bank.  What it does is construct the command-line for logparser.exe and set it on it’s way.  You can give it the start and end date, the folder your *.log files are in, the name of and output logfile (it outputs *.csv) and even a keyword in the request URI path.  Your final output is a CSV file of all the PDFs (that’s hardcoded for the moment) that have been downloaded for, say 2011.  In Excel you can perform some easy data analysis (pivot tables are a must) and find the true number of downloads of PDF documents from your site.

If you run this, check the user-agents and add any robots you find to the WHERE clause builder.  The signatures  included are just the ones we get.  It’s handy to compare this with your Google analytics traffic to see what’s not getting recorded.

Of course a smarter way to do this would be to build an ISAPI filter that listened for PDF requests and fired off an event to Google Analytics recording each download but that seems like a lot of work given that my colleagues don’t have access to GA, wouldn’t know what to do with it if they did and would probably prefer the data in Excel format anyway.

Leave a Reply

Your email address will not be published. Required fields are marked *