A Clockwork Noodle

Make such knaveries yours!

Clickstream tracking with Apache

leave a comment »

I've embarked on the wearying task of identifying users on our company's website. It's a pretty bad website, granted, though at least I still have the excuse of not having designed it. "Not mine," I say, fifty times a day. Still, that can only work for so long before you start to hear the "what the hell does that guy do all day anyway" spiel on the breeze.

We have tons of (what I perceive as) problems with our site. First, and foremost in my mind, is we have not separated out the various parts of the site. Everything is jumbled together like a stew. This makes it extremely difficult for me to be able to extract any useful information. "Hey! we served 250,000 pages last month!" Sounds great, good job and all that. "Hey! 240,000 of the pages we served were to vendors, spiders and hackers!" Ooh, ouch, let's just sweep that business under the carpet.

The first thing I will change about our site in the big redesign will be to turn the "home page" into a "portal." That just means I won't have much on the home page that is marketing-centric, or anything -centric for that matter. The first page should funnel users into the appropriate section. Any analysis tools I use or develop can much more easily filter out the junk I don't care about.

So back to the present… our content manager is Mambo – a fine CMS for what we do with it. Yet, it is sorely lacking in the user tracking department. Particularly when it comes to the granular level of detail that the corner office is interested in. (I will spare you the psychology of that situation.) On top of that we use PHPlist to send email alerts to our clients when we update the news, up to several times a day. (Not mine!) The boss insists we should be able to capture email addresses when users visit our site. "Why not just grab their addresses, phone numbers and credit cards while we're at it," I gripe to whoever will listen.

However, it is intirely in the realm of possibility to track links sent via email by using a "turnstile" approach. All links in the email are directed to a script which translates the funny code into the "real" link, while grabbing any data we may have included, such as the user id. Since we know the email address (we sent them the mail, duh) we can tag the users (6079 Smith W) and watch every move they make!

Apache includes a number of handy modules to facilitate this approach. One is mod_rewrite. This module allows you to change the text in the URL you've received to whatever you want. My turnstile script wants two pieces of information: the link ID and the user ID. I store the link ID in a table with a little front-end to assign IDs to whatever URL I paste in. Then I can send out this link:

http://mysite.com/x/abc123/aabbcc112233

Then using mod_rewrite in my .htaccess file, as such:

RewriteEngine On 
RewriteBase / 
RewriteRule ^x$ /x/ [R]           # just in case... 
RewriteRule ^x/(.+)$ /x/?qs=$1/   # closing / to bookend our data

I turn it into a querystring which I can parse out.

http://mysite.com/x/?qs=abc123/aabbcc112233

<?php $stuff = explode("/",$_GET['qs']) ...

(Leave a comment if you want all the code.)

The mod_usertrack module is another handy module for Apache. It just gets/sets a cookie for all of your users as they pass through. By combining this with a custom log in your Apache httpd.conf, voila! You've got a clickstream log.

CookieTracking on 
CookieName mysiteUser 
CookieDomain .mysite.com 
CookieExpires 31536000   # 1 year 
SetEnvIf Request_URI "(.[Gg][Ii][Ff]|.[Pp][Nn][Gg]|.[Jj][Pp][Gg]|.[Ss][Ww][F 
f]|.[Jj][Ss]|.[Cc][Ss][Ss]|.[Ii][Cc][Oo])$" dontlog 
CustomLog /www/logs/clickstream "%{cookie}n %{[%F %T]}t %U%q" env=!dontlog

So, now Apache automatically tracks your users, and you can link email to ApacheID. Now, all you need to do is process the clickstream logs every day to keep your server from blowing up. Maybe I'll go over that next time.

Written by greensweater

2006-05-01 at 09:05

Posted in General

Leave a comment