News Flash! Hosting Update

This blog started I don’t know when on some cheap GoDaddy hosting back in the ADSL days. They sent me a bill for $50 for another years’ hosting and I thought that was just too much, backed up the site and let it fall offline.

Much later, for a lark, I took a surplus Eee from my employer, put it on Ubuntu and stuck it in the server room. It worked! It worked surprisingly well, Nginx running on Linux is no joke. But all good things come to an end and the poor bastard died.

At work we had a legacy account at HostGator (a ‘baby’ package) where the main site used to live. It’s very cheap, $50 a year, ha ha, so it’s been less bother to keep it than move the few sites that were still there. It’s ‘unlimited’ storage so there would be no trouble keeping this site, sachidan.com and bartripping.com there.

The only trouble is that HostGator is complete bullshit. The site is so horrifically underpowered that any image uploaded to WordPress larger than 1MB, or even smaller, would cause the instance to crumble under its heavy load. The little Eee that the site used to be on was Concorde by comparison.

Well, enough of this nonsense, but I don’t like paying $3/month for hosting. It turns out that one of our new company websites is going to be on WordPress so I’ve been marshalling a couple of our DigitalOcean droplets into serving WordPress and since the dev server isn’t too busy anyway… The site has a new home. Regular readers (hi Dad) will notice a swift increase in performance and I won’t have to turn my photos into postage stamps to get them served.

DigitalOcean does have an out-of-the-box WordPress instance that’s actually pretty good but it runs on Apache and I’m completely over Apache. Hey, it’s a good webserver and WordPress is the killer app for Apache on PHP but, even though I find nginx server blocks occasionally infuriating, if I never have to edit an .htaccess file again I will die a happy man.

So sorry it’s been quiet lately, for years, but it’s been no fun on crap hosting, even though this site is all about crap hosting. Sooner or later I’ll buy a Raspberry Pi to host on, but until then you’ll all have to get by with cloud.

Take a Walk through your SharePoint Farm

When tasked with upgrading our long-neglected ‘intranet’ last year my first job was to work out just how much data was out there and what needed to be upgraded.

The masterpage had been butchered some time in the past so most of the pages were missing navigation, making it hard to follow sites down the hierarchy.  And what a hierarchy!  The architects of the original instance apparently worked out that you could have more than one document library per site, or that you could create folders.  The result is the typical site sprawl.  To add to the fun, some sites were created using some custom template that no longer works and others didn’t have any files at all in them.

In order to create a list of all the sites and how they relate, you can use a PowerShell script:

[code lang=”PowerShell”]

[System.Reflection.Assembly]::LoadWithPartialName("Microsoft.SharePoint") > $null

function global:Get-SPSite($url){
return new-Object Microsoft.SharePoint.SPSite($url)
}

function Enumerate-Sites($website){
foreach($web in $website.getsubwebsforcurrentuser()){
[string]$web.ID+";"+[string]$web.ParentWeb.ID+";"+$web.title+";"+$web.url
Enumerate-Sites $web
$web.dispose()
}
$website.dispose()
}

#Change these variables to your site URL and list name
$siteColletion = Get-SPSite(http://example.org)

$start = $siteColletion.rootweb

Enumerate-Sites $start
$start.Dispose()
[/code]

It’s actually pretty simple since we take advantage of recursion.  Pretty much a matter of getting a handle to the site collection, outputting its GUID and parent site GUID and then the human-readable title and URL.  They you do the same for that site, and so on down the branch.

The reason we’re outputting the GUIDs is that we can use them to relate each site to the rest.  The script outputs straight to the console but if you pipe the output to a text file you can use it for the input to a org-chart diagram in visio.  The results are terrifying:

image

Each node on that diagram is a site that may have one, or thousands of documents.  Or nothing.  Or the site just may not work.  As it turned out, when asked to prioritize material for migration, the stakeholders decided it would be easier just to move the day-to-day stuff and leave the old farm going as an ‘archive’.  Nothing like asking a client to do work to get your scope reduced!

As a final note on this script, it is recursive so could (according to my first-year Comp. Sci. lecturer) theoretically balloon out of control and consume all the resources in the visible universe, before collapsing into an ultradense back hole and crashing your server in the process, but you’d have to have a very elaborate tree structure for that to happen, in which case you’d probably want to partition it off into separate site collections anyway.

More on Web Log Analysis

In my previous post on web log analysis, I described a Powershell wrapper script for LogParser.exe, which lets you do SQL-style queries to text logfiles.  Today I have another script which wraps that script and is used in a timer job to send the filtered logs to the client each month.

[sourcecode language=”powershell”]
#GenerateLogAnalysis will query IIS logfiles and output logs for PDF downloads from the first until
#the last day of the previous month

#function that performs the actual analysis
function RunLogAnalysis(){

$command = "c:\users\daniel.cooper\desktop\scripts\queryLogs.ps1 -inputFolder {0} -outputFile {1} -startDate {2} -endDate {3} -keyword {4}" -f $inputFolder, ($outputPath+$outputFile), (ConvertDateToW3C($startDate)), (ConvertDateToW3C($endDate)), "elibrary"
$command
invoke-expression $command

$emailBody = "<div style=""font-family:Trebuchet MS, Arial, sans-serif;""><img src=""http://www.undp.org/images/cms/global/undp_logo.gif"" border=""0"" align=""right""/><h3 style=""color:#003399;"">Log Analysis</h3>A log anaylsis has been run on the eLibrary for PDF files for "+$monthNames[$startDate.month-1]+" "+$startDate.Year+"<br/>Please find it attached."

sendEmail "recipient@example.org" "sender@example.org" "eLibrary Log Analysis: $outputFile" ($outputPath+$outputFile) $emailBody
}

function ConvertDateToW3C($dateToBeConverted){

return "{0:D4}-{1:D2}-{2:d2}" -f $dateToBeConverted.year, $dateToBeConverted.month, $dateToBeConverted.day;

}

function sendEmail($toAddress, $fromAddress, $subject, $attachmentPath, $body){

$SMTPServer = "yourMailServer"

$mailmessage = New-Object system.net.mail.mailmessage
$mailmessage.from = ($fromAddress)
$mailmessage.To.add($toAddress)
$mailmessage.Subject = $subject
$mailmessage.Body = $body

$attachment = New-Object System.Net.Mail.Attachment($attachmentPath, ‘text/plain’)
$mailmessage.Attachments.Add($attachment)

$mailmessage.IsBodyHTML = $true
$SMTPClient = New-Object Net.Mail.SmtpClient($SmtpServer, 25)
$SMTPClient.Send($mailmessage)
$attachment.dispose()
}

#Current Month
$currentDate = Get-Date
$localDateFormats = new-object system.globalization.datetimeformatinfo
$monthNames = $localDateFormats.monthnames
$localDateFormats.dispose
#Generate first day of last month as a date
$startDate = $currentDate.AddMonths(-1).addDays(-$currentDate.AddMonths(-1).day+1)

#Generate last day of last month as a date
$endDate = $currentDate.AddDays(-$currentDate.day)

#Set the initial parameters
$inputFolder = "c:\temp\www.snap"
$logName = "SNAP"
$outputFile = "LogAnalysis_"+$logName+"_"+$startDate.year+$monthNames[$startDate.month-1]+".csv"
$outputPath = "C:\Users\daniel.cooper\Desktop\"

RunLogAnalysis($inputFolder, $outputFile, $startDate, $endDate)
[/sourcecode]

What’s happening here is that RunLogAnalysis() is the main controller function.  What is does is set up the command to run the queryLogs.ps1 script mentioned in the previous post, waits until it’s run and then email the result off.  We have another function, ConvertDateToW3C, which takes a date-parsable string and converts it to W3C format, which is what LogParser.exe likes.  sendEmail() is pretty straightforward, it’s a generic email-sending function.

After the functions we have a little code to set up parameters.  My task was to email the client the last month’s logs for PDF downloads on the first of each month.  To do this we get last month’s name (for the output filename) , the date on the first of last month and the date on the last day of the last month.

After parameter generation is done, we perform the log analysis and email the result.  This is created as a scheduled task on the webserver and we’re done.

Weblog Madness!

In the these days of Google Analytics it’s a bit passé to talk about boring of web server logs.  But there’s still good reasons to to go diving into the into the big text files generated by IIS or Apache.  In my poorly paid and humiliating day job I was recently asked to find out how popular our PDF publications were.  The trouble is that, being a traditional-style organisation, most staff members think of the internet as an email medium and send links to PDFs via email ‘blasts’.  Downloads this way can’t be picked up via the standard Google Analytics javascript-based tags.  We have a central library of publications and I made landing pages but that’s a bit like closing the gate after the horse has bolted, what about last year’s traffic?

The only true answer is to go look at the actual logs of what files were served to whom and when.  The data’s all in there!  There’s only two problems:

  1. Those are some big-ass files to filter
  2. Lots of downloads are by spiders, rather than people.

Problem #1 is pretty easy to fix.  Microsoft provides a command-line DOS tool as part of it’s IIS5 administrator’s toolkit (you can Google that) which will let you do SQL-like queries against W3C format logfiles (and lots of other log formats).  Problem #2 is a bit more work.  Using the user-agent parameter of a HTTP request we can spot the spiders and filter on them, but there’s a great many of them!  Building the WHERE clauses for the query is a major effort and you risk missing a bracket or comma somewhere.

The solution, as is to all life’s big problems, is to automation or, more specifically, scripts.  PowerShelll this time…

param([string]$inputFolder = "none", [string]$outputFile = "none", [string]$startDate = "none", [string]$endDate = "none", [string]$keyword = "none")

#Query Logs is a wrapper for LogParser.exe which allows SQL-like queries to logfiles
#It is set to query IIS logfiles for PDF downloads, to filter out web spiders and output in a useful format
#With parameters it can output a date range and filter on a keyword withing the PDF filename.

function output-help(){

"USAGE: .\queryLogs.ps1 -inputfolder xxx -outputfile xxx [-startdate xxx] [-enddate xxx] [-keyword xxx] "

}

switch("none"){

$outputFile {
"No output file specified!"
output-help
exit
}
$inputFolder {
"No input folder specified!"
output-help
exit
}

}

function buildRobotExcludeStatement([string]$botname){

"INDEX_OF(TO_LOWERCASE(cs(User-Agent)), TO_LOWERCASE('$botname')) = null"

}

$logparserLocation = "LogParser.exe"

$selectStatement = "SELECT date, cs-uri-stem, cs-uri-query, c-ip, cs(User-Agent)"
$fromStatement = "FROM $inputFolder\*.log TO $outputFile"

$whereStatement = "WHERE sc-status = 200 AND cs-method = 'GET' AND INDEX_OF(cs-uri-stem, '.pdf') > 0"

if($startDate -ne "none"){

$whereStatement = "$whereStatement AND date >= '$startDate'"

}

if($endDate -ne "none"){

$whereStatement = "$whereStatement AND date0"

}

$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement "http:")
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("robot"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("xenu"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("vse"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("urlchecker"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("TimKimSearch"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("Jakarta+Commons"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("bot"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("spider"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("yandex"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("Xerka"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("www"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("crawler"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("HttpComponents"))
$whereStatement = "{0} AND {1}" -f $whereStatement, (buildRobotExcludeStatement("leecher"))

$parameters = "-i:IISW3C `"$selectStatement $fromStatement $whereStatement`" -o:csv"
$command = "$logparserLocation $parameters"
$command
"(running)"
iex $command

You can cut that code out and take it to the bank.  What it does is construct the command-line for logparser.exe and set it on it’s way.  You can give it the start and end date, the folder your *.log files are in, the name of and output logfile (it outputs *.csv) and even a keyword in the request URI path.  Your final output is a CSV file of all the PDFs (that’s hardcoded for the moment) that have been downloaded for, say 2011.  In Excel you can perform some easy data analysis (pivot tables are a must) and find the true number of downloads of PDF documents from your site.

If you run this, check the user-agents and add any robots you find to the WHERE clause builder.  The signatures  included are just the ones we get.  It’s handy to compare this with your Google analytics traffic to see what’s not getting recorded.

Of course a smarter way to do this would be to build an ISAPI filter that listened for PDF requests and fired off an event to Google Analytics recording each download but that seems like a lot of work given that my colleagues don’t have access to GA, wouldn’t know what to do with it if they did and would probably prefer the data in Excel format anyway.