Urchin 5 and dynamic URLs with query parameters/arguments

Submitted by Hannes Schmidt on Tue, 12/06/2005 - 16:14.

Urchin is a professional web log analysis and statistics application. It was recently acquired by Google and what used to be called Urchin 6 On Demand is now being integrated into Google Analytics. I don’t know for how long the stand-alone Urchin 5 will be around but right now it is still used by many individuals and corporations. I was not happy with the way Urchin deals with dynamic URLs, i.e. URLs that have query parameters in them. After playing around with Urchin 5's advanced filters for a while, I came to the conclusion that they can be employed to improve Urchin's dynamic URL handling.

Nice Try: The DynamicURL Filter

Urchin 4 simply strips query parameters from every URL. As this is obviously not what many people want, an advanced filter called DynamicURL can be used to match certain query parameters and map them to static looking URLs. For example,

/catalog.php?prodID=12

can be mapped to

/catalog.php/prodID/12

by using a DynamicURL filter with the following regular expression

(/catalog\.php\?)(prodID=.*)

The DynamicURL filter compares every URL against a regular expression. The regular expression should contain at least one pair of parentheses. If it matches, the URL will be replaced by another URL that consists of the text matched by each pair of parentheses with a slash in between.

The DynamicURL filter isn’t a very elegant solution. It creates artificial URLs that cannot be clicked in the statistics page because they do not point to the original page. More importantly, they impose a certain order of query parameters. Unless a certain order can be guaranteed, one filter is needed for every possible permutation of parameters. This can get easily out of hand for more than two parameters. A URL with three parameters like

/script.php?a=1&b=2&c=3

for example, needs six filters:

(/script\.php\?)(a=.*)&(b=.*)&(c=.*)
(/script\.php\?)(a=.*)&(c=.*)&(b=.*)
(/script\.php\?)(b=.*)&(c=.*)&(a=.*)
(/script\.php\?)(b=.*)&(a=.*)&(c=.*)
(/script\.php\?)(c=.*)&(a=.*)&(b=.*)
(/script\.php\?)(c=.*)&(b=.*)&(a=.*)

Using a single, more generic filter as in

(/script\.php\?)([a-z]+=.*)&([a-z]+=.*)&([a-z]+=.*)

only works if there are no additional parameters that should not be mapped. Either way, every permutation will generate a distinct report entry. For example,

/script.php?a=1&b=2&c=3 maps to /script.php/a=1/b=2/c=3 and /script.php?b=2&a=1&c=3, although semantically equivalent, maps to /script.php/b=2/a=1/c=3.

Two Different Kinds of Parameters

With version 5, Urchin goes the other extreme by stripping all URL parameters and listing them in a separate report (Page Query Terms). In some situations, this might be the right thing to do. In others situations, especially with content management systems, it will most likely be undesirable. The key concept that Urchin fails to implement is that URL parameters fall into two sets. The first set of parameters identifies pages. The second set determines slight alterations to the content of these pages. The URL

/addToCart.php?catalog=1&product=2&session=654372392

for example, has three parameters. Two of those define a page: product and catalog. The session parameter holds the session id and is used to identify one particular browser session. The URL

/addToCart.php?product=2&catalog=1&session=654372392

points to the very same page. Consequently, both URLs should be listed in the statistics under the single entry

/addToCart.php?catalog=1&product=2

A web log analysis application should

  • allow the user to specify which parameters define pages. These parameters should be listed as part of the URL in the report the other should be broken down into a separate report.
  • be insensitive to the order in which parameters occur in URLs.
  • produce reports in which entries can be clicked in order to get to the respective page.

Urchin 5: Flexible Parameter Handling Using Advanced Filters

In the remainder of this article, I will show how a well-crafted chain of filters can be used to accomplish all of the above requirements with Urchin 5. In order to implement the steps outlined below, an Urchin account with administrative rights must be used. This solution will only be effective if your naming of parameters is more or less consistent. However, the order in which parameters occur in URLs is not significant.

Under Configuration – Urchin Profiles – Filter Manager, create the following filters:

The first filter adds a leading ampersand to the query such that we can safely assume that every parameter starts with an ampersand. Urchin is robust enough to ignore superfluous ampersands so adding sentinels like that is safe.

Name Add & to the beginning of request_query
Filter Type Advanced
Field A request_query Extract A ^[^&]
Field B request_query Extract B (.*)
Output To request_query Constructor &$B1
Override Output Field Yes
Required Fields A Required Only
Case Sensitive No

The second filter adds a trailing question mark to every request stem. A request stem is the part of the URL before the query. Adding a sentinel helps us when we append parameters to the stem later on but we will have to clean this up eventually with another filter.

Name Add ? to the end of request_stem
Filter Type Advanced
Field A request_stem Extract A [^?]$
Field B request_stem Extract B (.*)
Output To request_stem Constructor $B1?
Override Output Field Yes
Required Fields A Required Only
Case Sensitive No

Add two filters for every parameter that belongs to the set parameters identifying pages. One filter copies the parameter from the request query to the request stem. Replace <p> with the name of the actual parameter.

Name Copy <p> to request_stem
Filter Type Advanced
Field A request_stem Extract A (.*)
Field B Request_query Extract B (&<p>=[^&]+)
Output To Request_stem Constructor $A1$B1
Override Output Field Yes
Required Fields Both Fields
Case Sensitive No

The other filter removes the parameter from the request_query. Replace <p> with the name of the actual parameter. Ideally, the replacement string should be empty but Urchin doesn’t let us do that. Fortunately, Urchin ignores the superfluous ampersands left in the query by this filter.

Name Remove <p> from request_query
Filter Type Search & Replace
Filter Field request_query
Search String &<p>=[^&]+
Replace String &
Case Sensitive No

Repeat the above two filters for every parameter you would like to be part of the page URL in your Urchin reports. After that, we need two more filters to cleanup the sentinels we added in the beginning. First, let’s remove the leading ampersand.

Name Strip ?& from request_stem
Filter Type Search & Replace
Filter Field request_stem
Search String \?&
Replace String ?
Case Sensitive No

Get rid of trailing question marks. Some of the question marks added earlier are now followed by query parameters. These should not be removed.

Name Strip trailing ? from request_stem
Filter Type Advanced
Field A request_stem Extract A \?$
Field B request_stem Extract B ^([^?]+)\?$
Output To request_stem Constructor $B1
Override Output Field Yes
Required Fields Both Fields
Case Sensitive No

Now, add the previously created filters to some or all of your log sources. Under Configuration – Urchin Profiles – Log  Manager, click Edit to edit a log source and go to the Log Filters tab. Add the filters one by one in the following sequence:

Add & to the beginning of request_query
Add ? to the end of request_stem
Copy <p1> to request_stem
Remove <p1> from request_query
…
Copy <pN> to request_stem
Remove <pN> from request_query
Strip ?& from request_stem
Strip trailing ? from request_stem

The order in which you add the "Copy <p> to request_stem" filters to the Log source will be the order in which the respective parameters appear in the page entry in your Urchin reports. For example, if you add them in the order

Copy <p1> to request_stem
Remove <p1> from request_query
Copy <p2> to request_stem
Remove <p2> from request_query

the report entry will look like

example.php?p1=…p2=…

regardless of whether the requested URL was

example.php?p1=…p2=…

or

example.php?p2=…p1=…

( categories: Unix | Windows | Webmaster )
Submitted by Anonymous on Wed, 07/11/2007 - 12:14.

Thanks for your reply Hannes. That doesn't work either. I'm about to post to the Google Analytics Group to find out why none of my filter works.

Submitted by Hannes Schmidt on Wed, 07/04/2007 - 23:23.

Try an advanced filter. Use referral_domainandstem for field A and B and output field. Use the regex ^[^.]+\.(mail\.yahoo\.com/.*)$ for Extract A. Regex for B should be empty. Use $A1 as constructor. Required Fields should be set to A only and Override should be Yes.

-- Hannes

Submitted by Anonymous on Wed, 07/04/2007 - 13:28.

I'm trying to clean Referrers in Google Analytics (formely Urchin) but I can't make them work. For example, I'm trying to group "us.f639.mail.yahoo.com" or "fr.f387.mail.yahoo.com" into one single "mail.yahoo.com".

Filter Type: Custom filter - Search and Replace
Filter Field: Referral
Search String: .*\.mail\.yahoo\.com.*
Replace String: mail\.yahoo\.com

But this doesn't seem to work, as no "mail.yahoo.com" shown up, and there's still silly Yahoo servers names in my referrers.

Submitted by Hannes Schmidt on Sun, 11/12/2006 - 13:47.

Hi, I can't give you the solution to your exact problem but I have a few suggestions that might help you find one. It seems that you are on the right track but the semi-colon looks very suspicious to me. The standard URL syntax for a query would be /images/spacer.gif?jsessionid=F36E… so I wonder how the semi-colon made it into your URLs. Could the log format of your web server be incompatible with Urchin? It might also be that Urchin treats a semi-colon specially, who knows. In order to avoid potential conflicts, try matching against .jsessionid=[a-z0-9]+$ or \;jesessionid[a-z0-9]+$. Also I would leave Extract B empty and set Required to A only.

-- Hannes

Submitted by Anonymous on Fri, 11/10/2006 - 19:48.

I've tried it, but failed miserably.
Some urls I have are followed by ";jsessionid=[A-Z0-9]+$" that I want to get rid of.

I've tried search and replace. And finally the last attempt was an Advanced filter.

Field A : cs_uristem(RAW)      Extract A:   ;jsessionid=.*$
Field B : cs_uristem(RAW)      Extract B:   ^([^;]+);jsessionid=.*$
Output To: request_stem(AUTO)  Constructor: $B1
Overwrite: Yes
Required: Both
Case Sen : No

Now, I've tried to interchange cs_uristem and request_stem and exhausted all possible permutations. I've tried all possible regexes that kame to mind

;jsessionid=[[:alnum:]]+$
;jsessionid=[A-Z0-9]+$
;jsessionid=.+$
;.*

etc, and nothing good came out of this. I've tried to use it as a log filter and as a profile filter and both. Deleted profiles and recreated then under a different names ( to force parser to reread the log file ). At this point I think I'm ready to make a little hole im my head.

Any suggestions on the drill bit size or maybe a solution to this filtering problem ?

P.S. This is the exact cs_uristem:
/images/spacer.gif;jsessionid=F36E96E03D0C9F9619670C34C8EF9E80

Submitted by Hannes Schmidt on Thu, 01/26/2006 - 16:56.

Sorry, I have not yet used UTM.

Submitted by Anonymous on Thu, 01/26/2006 - 14:37.

This is good. Urchin is advanced in filters and wish there was more info about it.

I have a question, do you have a filer example that would track the click paths using the utm_id code? I would like to know based on the utm_id where the user went in the site after.

Thanks in advance for the help!

Submitted by Hannes Schmidt on Thu, 12/15/2005 - 13:14.

Thank you, Enayet. I have not seen that article before. It's very similar to my technique. I wonder why it doesn't work for you. Maybe it's because the first filter uses request_uri whereas my solution uses request_query.

You are right, trial-and-error with Urchin filters can be very tedious. What I did was that I moved all but the most recent logfile out of the log file directory, deleted Urchin's entire database and then ran the profile with just this one log file in place. Once I got it working, I moved all the logfiles back into place and ran the profile again. But this was only possible because I had a complete set of logfiles. Most web servers rotate logfiles after a while so it might be that Urchin's database contains records from log files that have long gone. In that case you have to very careful about making a backup before deleting your entire Urchin database.

Submitted by Anonymous (not verified) on Thu, 12/15/2005 - 11:08.

I enjoyed reading your article today. How did you "play" around with the regular expressions? It seems to me that it is quite tedious to change a filter and reprocess the logs everytime to see how it works.

One more question to you...I am trying to duplicate this technote but I am getting no results in the report. Have you done this before?

-Enayet

Submitted by Anonymous on Thu, 12/15/2005 - 05:17.

Thanks, Hannes. I will give that a try. Again - thanks for the resource.

Luke

Submitted by Hannes Schmidt on Wed, 12/14/2005 - 22:20.

Hi Luke,

I think an Include Pattern ONLY filter with Filter Field set to Request stem (AUTO) and a Filter Pattern of ^/billing/ might do the trick. I haven't tested it out but it should point you in the right direction. I think the filter can either go into the Log Filter tab of a log source or into the Profile Filter tab of a profile.

-- Hannes

Submitted by Anonymous on Wed, 12/14/2005 - 14:38.

This is a great article - especially since documentation on filters is sparse.

I am trying to create a filter in Urchin 5 that just pulls hits to files and folders beneath the /billing directory. So, I want to ignore any data in the logfile unless it only report on records like /billing?index.php and not /images/data.gif.

Any suggestions?

Luke