Quick and dirty referer tracking

About once a month someone links to my blog and I get all excited to see how many clicks they drove. Despite 10+ years of Javascript tracking systems like Google Analytics, I still don’t have a proper tool to just quickly show me “what’s driving traffic today?”. I usually cobble together some hideous thing that starts awk { print $7, $11 } | grep -v ... and then get annoyed. So today I wrote a proper Python script:

#!/usr/bin/python

import itertools, re

skipRE = re.compile(r'(js|png|gif|jpg|css|rss091|atom|ico)$')
def skip(url, referer):
    return skipRE.search(url) is not None or \
        referer == '-' or \
        referer.startswith('http://www.somebits.com')

for l in itertools.chain(file("/var/log/apache2/access.log.1"), file("/var/log/apache2/access.log")):
    d = l.split()
    url = d[6]
    referer = d[10].strip('"')
    if not skip(url, referer):
        print url, referer

The output is then piped through a Python script I wrote years ago which counts up the most common lines.

One remaining mystery; the large majority of my pageviews have a referer of -, or empty. I know folks aren’t pasting URLs in by hand, where does this come from? Do redirectors somehow clear Referer?