More on Web Log Analysis

In my previous post on web log analysis, I described a Powershell wrapper script for LogParser.exe, which lets you do SQL-style queries to text logfiles.  Today I have another script which wraps that script and is used in a timer job to send the filtered logs to the client each month.

[sourcecode language=”powershell”]
#GenerateLogAnalysis will query IIS logfiles and output logs for PDF downloads from the first until
#the last day of the previous month

#function that performs the actual analysis
function RunLogAnalysis(){

$command = "c:\users\daniel.cooper\desktop\scripts\queryLogs.ps1 -inputFolder {0} -outputFile {1} -startDate {2} -endDate {3} -keyword {4}" -f $inputFolder, ($outputPath+$outputFile), (ConvertDateToW3C($startDate)), (ConvertDateToW3C($endDate)), "elibrary"
$command
invoke-expression $command

$emailBody = "<div style=""font-family:Trebuchet MS, Arial, sans-serif;""><img src=""http://www.undp.org/images/cms/global/undp_logo.gif"" border=""0"" align=""right""/><h3 style=""color:#003399;"">Log Analysis</h3>A log anaylsis has been run on the eLibrary for PDF files for "+$monthNames[$startDate.month-1]+" "+$startDate.Year+"<br/>Please find it attached."

sendEmail "recipient@example.org" "sender@example.org" "eLibrary Log Analysis: $outputFile" ($outputPath+$outputFile) $emailBody
}

function ConvertDateToW3C($dateToBeConverted){

return "{0:D4}-{1:D2}-{2:d2}" -f $dateToBeConverted.year, $dateToBeConverted.month, $dateToBeConverted.day;

}

function sendEmail($toAddress, $fromAddress, $subject, $attachmentPath, $body){

$SMTPServer = "yourMailServer"

$mailmessage = New-Object system.net.mail.mailmessage
$mailmessage.from = ($fromAddress)
$mailmessage.To.add($toAddress)
$mailmessage.Subject = $subject
$mailmessage.Body = $body

$attachment = New-Object System.Net.Mail.Attachment($attachmentPath, ‘text/plain’)
$mailmessage.Attachments.Add($attachment)

$mailmessage.IsBodyHTML = $true
$SMTPClient = New-Object Net.Mail.SmtpClient($SmtpServer, 25)
$SMTPClient.Send($mailmessage)
$attachment.dispose()
}

#Current Month
$currentDate = Get-Date
$localDateFormats = new-object system.globalization.datetimeformatinfo
$monthNames = $localDateFormats.monthnames
$localDateFormats.dispose
#Generate first day of last month as a date
$startDate = $currentDate.AddMonths(-1).addDays(-$currentDate.AddMonths(-1).day+1)

#Generate last day of last month as a date
$endDate = $currentDate.AddDays(-$currentDate.day)

#Set the initial parameters
$inputFolder = "c:\temp\www.snap"
$logName = "SNAP"
$outputFile = "LogAnalysis_"+$logName+"_"+$startDate.year+$monthNames[$startDate.month-1]+".csv"
$outputPath = "C:\Users\daniel.cooper\Desktop\"

RunLogAnalysis($inputFolder, $outputFile, $startDate, $endDate)
[/sourcecode]

What’s happening here is that RunLogAnalysis() is the main controller function.  What is does is set up the command to run the queryLogs.ps1 script mentioned in the previous post, waits until it’s run and then email the result off.  We have another function, ConvertDateToW3C, which takes a date-parsable string and converts it to W3C format, which is what LogParser.exe likes.  sendEmail() is pretty straightforward, it’s a generic email-sending function.

After the functions we have a little code to set up parameters.  My task was to email the client the last month’s logs for PDF downloads on the first of each month.  To do this we get last month’s name (for the output filename) , the date on the first of last month and the date on the last day of the last month.

After parameter generation is done, we perform the log analysis and email the result.  This is created as a scheduled task on the webserver and we’re done.

Weblog Madness!

In the these days of Google Analytics it’s a bit passé to talk about boring of web server logs.  But there’s still good reasons to to go diving into the into the big text files generated by IIS or Apache.  In my poorly paid and humiliating day job I was recently asked to find out how popular our PDF publications were.  The trouble is that, being a traditional-style organisation, most staff members think of the internet as an email medium and send links to PDFs via email ‘blasts’.  Downloads this way can’t be picked up via the standard Google Analytics javascript-based tags.  We have a central library of publications and I made landing pages but that’s a bit like closing the gate after the horse has bolted, what about last year’s traffic?

The only true answer is to go look at the actual logs of what files were served to whom and when.  The data’s all in there!  There’s only two problems:

  1. Those are some big-ass files to filter
  2. Lots of downloads are by spiders, rather than people.

Problem #1 is pretty easy to fix.  Microsoft provides a command-line DOS tool as part of it’s IIS5 administrator’s toolkit (you can Google that) which will let you do SQL-like queries against W3C format logfiles (and lots of other log formats).  Problem #2 is a bit more work.  Using the user-agent parameter of a HTTP request we can spot the spiders and filter on them, but there’s a great many of them!  Building the WHERE clauses for the query is a major effort and you risk missing a bracket or comma somewhere.

The solution, as is to all life’s big problems, is to automation or, more specifically, scripts.  PowerShelll this time…

param([string]$inputFolder = "none", [string]$outputFile = "none", [string]$startDate = "none", [string]$endDate = "none", [string]$keyword = "none")

#Query Logs is a wrapper for LogParser.exe which allows SQL-like queries to logfiles
#It is set to query IIS logfiles for PDF downloads, to filter out web spiders and output in a useful format
#With parameters it can output a date range and filter on a keyword withing the PDF filename.

function output-help(){

"USAGE: .\queryLogs.ps1 -inputfolder xxx -outputfile xxx [-startdate xxx] [-enddate xxx] [-keyword xxx] "

}

switch("none"){

$outputFile {
"No output file specified!"
output-help
exit
}
$inputFolder {
"No input folder specified!"
output-help
exit
}

}

function buildRobotExcludeStatement([string]$botname){

"INDEX_OF(TO_LOWERCASE(cs(User-Agent)), TO_LOWERCASE('$botname')) = null"

}

$logparserLocation = "LogParser.exe"

$selectStatement = "SELECT date, cs-uri-stem, cs-uri-query, c-ip, cs(User-Agent)"
$fromStatement = "FROM $inputFolder\*.log TO $outputFile"

$whereStatement = "WHERE sc-status = 200 AND cs-method = 'GET' AND INDEX_OF(cs-uri-stem, '.pdf') > 0"

if($startDate -ne "none"){

$whereStatement = "$whereStatement AND date >= '$startDate'"

}

if($endDate -ne "none"){

$whereStatement = "$whereStatement AND date0"

}

$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement "http:")
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("robot"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("xenu"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("vse"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("urlchecker"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("TimKimSearch"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("Jakarta+Commons"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("bot"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("spider"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("yandex"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("Xerka"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("www"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("crawler"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("HttpComponents"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("leecher"))

$parameters = "-i:IISW3C `"$selectStatement $fromStatement $whereStatement`" -o:csv"
$command = "$logparserLocation $parameters"
$command
"(running)"
iex $command

You can cut that code out and take it to the bank.  What it does is construct the command-line for logparser.exe and set it on it’s way.  You can give it the start and end date, the folder your *.log files are in, the name of and output logfile (it outputs *.csv) and even a keyword in the request URI path.  Your final output is a CSV file of all the PDFs (that’s hardcoded for the moment) that have been downloaded for, say 2011.  In Excel you can perform some easy data analysis (pivot tables are a must) and find the true number of downloads of PDF documents from your site.

If you run this, check the user-agents and add any robots you find to the WHERE clause builder.  The signatures  included are just the ones we get.  It’s handy to compare this with your Google analytics traffic to see what’s not getting recorded.

Of course a smarter way to do this would be to build an ISAPI filter that listened for PDF requests and fired off an event to Google Analytics recording each download but that seems like a lot of work given that my colleagues don’t have access to GA, wouldn’t know what to do with it if they did and would probably prefer the data in Excel format anyway.

Your client does not support opening this list with Windows Explorer

You’re not supposed to see the above error message unless you’re browsing your SharePoint site from a server, using XP SP2, Vista or IE6.

If you’re on Windows 7 and SharePoint 2010 like me and you get this then you have a problem with your connection to the server, probably at the authentication layer.

The easiest way to fix it is:

  • Click Start
  • Right-Click ‘Computer’
  • Select ‘Manage’
  • Expand ‘Services & Applications’
  • Click ‘Services’
  • Find WebClient and restart.
If it works after that then it’s just a hiccup.  If it still won’t work you need to start googling for WebDAV ports and suchlike.

Migrating to SharePoint 2010

Upgrades can be a titanic pain and a platform with as many moving parts as SharePoint means you’re in for a lot of headaches.

If you’re upgrading Office or even Windows, it’s usually just a matter of sticking a DVD in your machine, hitting OK a few times and going for a coffee.

The first problem with upgrading to SharePoint 2010 is its requirements: it has to be on Windows Server 2008 64-bit, so you may find yourself upgrading the OS in the first place.

I’ve got a good idea!

Because of this, the upgrade task seems like a good opportunity to upgrade your hardware as well.  For example, we moved our two farms onto a single, virtualised farm.

The trouble starts at the planning stage.  If you’re moving from an old farm to a new one, you’re not upgrading you’re migrating and pretty much all the support out there is for upgrades.

Two Paths to Follow

SharePoint 2010 gives you two options for upgrading, a database attach upgrade or an in-place upgrade.  We’re doing a database attach because upgrading production servers (which are a mess) into the unknown sounds like a lot of weekends spent in the office.

With a database attach upgrade you backup a content database from your 2007 farm, move it to your new database, create a receiver web application on your target farm and then go into Powershell to mount the new database into the app you just made.  This grands away for a while, after which you have your old website on your new server.  Great!  It can even look the same, as you can pull the 2007 masterpages along, or you can elect to upgrade the ‘user experience’ during the database attach.

And then the Trouble Began

So this is all good (as long as you don’t have any custom code, solutions or files on the old farm that aren’t on the new farm) as long as you’re happy with the new farm where it is, as it is.

We have two 2007 farms, one for intranet and one for extranet.  We’re merging them and redesigning the main site, along with lots of other changes, so we’re moving sites and lists out of the upgraded web application into a new application/site collection.

Now, within a single site collection you can use the Content and Structure tool to move sites, lists and items about.  But if you want to move something between site collections, let alone web applications, it’s a bit trickier.

Powershell to the Rescue

If you can’t do Powershell, you can’t manage SharePoint 2010, it just can’t be done.  Now, there should be a command to move a site or list, right?  Something like Move-SPWeb or some such?

I’m afraid not.  You can get some fancy-pants software to do that but it costs an arm and a leg, one testicle and a handfull of teeth.  Particularly if you have a lot of material to move or lots of servers in your farm.  Plus you have to install proprietary APIs.

So you have to use Powershell, specifically, the Export-SPWeb and Import-SPWeb functions.

The Import/Export Obstacle Course

Here’s the first problem: you can’t just import into an empty URL.  You have to create a site on your target site collection, then perform the import.  OK, that’s not a problem, I’ll just create a blank site and import into that.

Experienced SharePoint administrators will immediately see a problem with that.  The site you create to import your material into has to be the same site template as the one you’re exporting from.  As a bonus, you can’t tell beforehand what template a site was made with (can anyone correct me on this?)  Luckily the import function will tell you what template you need in the error message.  God help you if your original site was some crazy template that doesn’t exist on your new farm.

The other irritant is that you end up with a bunch of blank and duplicate lists.  I hate clutter in SharePoint so it’s a fair bit of work to clean this up.

If my only Tool was  Hammer

Being a lazy soul, if I find myself doing something more than once I’ll look for a way to automate it.  It’s lucky I work with computers hey?  Since I’m punching all these commands into the Powershell console, I may as well just save them to a file.  That’s what’s done and it seems to work OK now.  I shall post the completed script in my next post.