True Geo-IP country geolocation for Google Urchin 5 statistics

Submitted by Hannes Schmidt on Thu, 12/15/2005 - 08:58.

If you are a user of Urchin 4 or 5 you might have noticed that Urchin’s “Countries” report does not match the standards that Urchin has set for professional website statistics.  It seems to be Urchin’s weak spot.  It might be ok for you but I was certainly disappointed when I discovered that Urchin determines (or should I rather say: attempts to determine) a visitor’s location from a database that is derived from reverse DNS lookups. This type of location reporting can hardly be considered state-of-the-art, even less so for a commercial website log file analytics application. This article is about how Urchin 5 can be fooled into using true Geo IP mappings for its Countries report. Ok, I admit this paragraph is stuffed with keyword for the search engines. But hey, you found it, didn't you?

Urchin analytics and reverse DNS lookups

Are you impatient or very occupied with more important things? Skip to section Installation Instructions!

A reverse DNS lookup is a query to the domain name system (DNS) that returns a domain name for an IP address. It’s called “reverse” because the DNS is commonly used to resolve domain names into IP addresses via forward mappings but not vice versa.  Some but not all internet providers pick a domain name that reflects the country of the IP addresses they assign to their customers. T-Com, for example, one of Germany’s bigger internet providers returns a reverse mapping that looks like p54BF0CBC.dip0.t-ipconnect.de. The "54BF0CBC" part is the hex-encoded IP address. But more importantly, the top-level domain (TLD) in this example is “de” which correctly identifies Deutschland (Germany) as the country location of the IP address. Reverse DNS resolution is optional, meaning that internet providers do not necessarily have to setup a reverse lookup record for their own IP ranges. As a consequence, not every IP can be resolved into a domain name. Furthermore, many providers all over the world choose to their reverse mappings in the “.net” or “.org” TLDs. Often, the forward and reverse mappings are not symmetric. For example, this site's domain hannesschmidt.de resolves to 217.160.186.48 for which the reverse mapping is not hannesschmidt.de but s15135970.rootmaster.info. In none of these cases can the IP's country location be concluded accurately. It's more like a guess. Experience with my own sites shows that 30% of all IP addresses do not have a reverse mapping and 20% map to generic TLDs like .net or .org.

Geo-IP/geolocation for website statistics

Nevertheless, Urchin 5 still uses reverse lookups although there are now much more accurate and reliable alternatives available. These alternatives are usually classified as Geo-IP, geo-targeting, geographical targeting or geo-IP-address mapping (more stuffing, I know). Most of them are available on a commercial basis only. This is because it is very labor-intensive to keep track of the physical location of every single assigned IP address range on this planet. However, some vendors of Geo-IP solutions release intentionally stripped-down versions of their databases for free. MaxMind, for example, offers its GeoLite Country Database free of charge. It also offers more accurate commercial databases that map IP addresses to regions, cities and even individual organizations.

Unlike most website statistic solutions (AWstats, for example), Urchin does not do the reverse-lookups while it processes the log files. Instead, every Urchin installation downloads a static database from the Urchin support website (now maintained by Google) on a monthly basis. When processing the log files, an Urchin installation looks up the reverse mappings in that database instead of issueing reverse queries to the DNS. This makes sense because the reverse mappings do not change that often and it reduces the server load. Although the domain database is downloaded from Google, giving them the opportunity to fill it with Geo-IP based data, they populate the database with the results from reverse DNS queries.

Fortunately, Urchin has reserved a loophole through which the database can be customized. In addition to records that are downloaded during the automatic monthly updates from Google, every Urchin installation’s database can also contain custom records. These custom records are imported from a file called domain.local. Each line in this file consists of an IP-address in quad-dotted notation (“12.16.38.120”) or a network address in the form “12.16.38.0/24” and a fully-qualified domain name. See “geo-update: DNS Database Update Utility” for details.

Faking it

This opens up an interesting possibility. Why not fake the entries in that database to map IP and network addresses into fictive domain names that have the Geo-IP country code as the top level domain? So for example, instead of

67.164.218.68     c-67-164-218-68.hsd1.ca.comcast.net

the entry could read

67.164.218.68     67-164-218-68.us

where "us" is the two-letter ISO country code for the United States. Urchin does not care what the actual domain name is, it only looks at the top-level domain (TLD). Fortunately, most country-specific top-level domains are identical to the countries' ISO codes. Note that historically only the U.S. had the privilege of assigning top-level domains such as .com or .net. Nowadays these TLDs can be assigned to anyone in the world and only some TLD are still restricted to the U.S., like .gov and .mil.

As it turns out, this is entirely possible. I wrote a little Perl script that converts MaxMind's free-of-charge CSV-formatted (comma-separated value) Geo-IP Lite database to the format required for Urchin’s domain.local database. The crucial aspect of this conversion is that MaxMind’s database is based on IP-ranges in the form <from-IP>-<to-IP> whereas Urchin only reads network addresses in the <IP>/<number-of-bits> format.

I find the script interesting from a scientific point of view as the algorithm that I came up with converts an arbitrary range of IP-addresses into a minimum number of non-overlapping network addresses that completely cover the input range. The script also demonstrates how to employ a binary search to find the highest-order 1 bit in a bit vector that is n bits wide in O(log n) time. I’m sure state-of-the-art processors have an instruction that does this in O(1) using their barrel shifters but unfortunately that instruction is not available as a Perl operator. Hmm, isn’t it interesting how we started with website statistics and ended up in low-level micro-processor design? Enough, enough – stop all this nerdy, self-appraising non-sense. Let's get down to business.

Installation Instructions

The following instructions apply to Urchin 5 for Windows but should also work for other platforms if adjusted accordingly. I am not sure if they can be used for Urchin 4 but I think it's worth a try. If anyone successfully implements this for Urchin 4, I would appreciate some feedback. Furthermore, I do not know whether my script works on 64-bit platforms as I don't have access to one. Again, feedback is highly appreciated.

  1. Download MaxMind’s GeoLite Country Database in zipped CSV format and unzip it. The only drawback of the free-of-charge database is that it does not accurately identify AOL users. It maps them all to the U.S. The commercial version resolves AOL IP’s to their correct country, so you might want to invest in that.
  2. If you’re using Windows, download and install ActiveState’s ActivePerl. On other platforms Perl is most likely already installed.
  3. Download my conversion script from the attachments section found at the end of this article and place it in in the same directory as the CSV file.
  4. Open a command prompt, change to the directory containing the CSV file and run the script like this:
  5. GeoIP2Urchin.pl < GeoIPCountryWhois.csv > domain.local
  6. Point you favorite browser (I hope it’s Firefox) to your Urchin installation’s administrative web interface.  Go to Configuration | Scheduler | Task Scheduler and disable the __domaindb task.  Don’t delete it, just specify Never for Frequency.
  7. You can (but you certainly don’t have to) install Perl and my script on the machine that runs Urchin (most likely your web server). Alternatively, you can run the script on you personal/development box and upload the resulting domain.local file to the Urchin machine. Whatever you do, the following steps need to be performed on the machine on which Urchin is installed.
  8. Move (I mean move, not copy) all files in C:\Program Files\Urchin\data\geodata to a backup directory out of Urchin’s reach.
  9. Put the domain.local file into the C:\Program Files\Urchin\data\geodata directory.
  10. Run
  11. C:\Program Files\Urchin\util\geo-update.exe –D

    This imports the entries in domain.local into Urchin’s binary domain database. The –D switch skips the downloading of database updates from Google’s Urchin support website. We don’t want any of the regular reverse-DNS records in our database, just the fake records from domain.local.

  12. Optional: From this point on, the visitor’s IP address in future log file entries will be mapped to their respective country using the true and accurate Geo-IP mapping derived from MaxMind’s database. If you would like to retroactively apply the new Geo-IO mapping to older log file records, you can delete the entire Urchin log database for a profile and run the profile manually. Beware that if the Urchin database contains data from log files that have already been deleted, you will loose statistics data! This is why I am not going to tell you how to perform this step. You ought to know what you’re doing.

That’s it, folks. Enjoy your much more comprehensive and accurate Urchin Countries report.

AttachmentSize
GeoIP2Urchin.zip2.07 KB
( categories: Unix | Windows | Webmaster )
Submitted by Anonymous on Wed, 05/13/2009 - 13:15.
the domain.local file is create but there is no data in it. Okb
Submitted by Anonymous on Sun, 09/07/2008 - 23:12.
Urchin is unavailable now. But we managed to use the previous version and import the IP2Location Geolocation database.
Submitted by Anonymous on Fri, 02/01/2008 - 07:07.
The stuff works just great ! kala.fior