Urchin 5 and dynamic URLs with query parameters/arguments
Urchin is a professional web log analysis and statistics application. It was recently acquired by Google and what used to be called Urchin 6 On Demand is now being integrated into Google Analytics. I don’t know for how long the stand-alone Urchin 5 will be around but right now it is still used by many individuals and corporations. I was not happy with the way Urchin deals with dynamic URLs, i.e. URLs that have query parameters in them. After playing around with Urchin 5's advanced filters for a while, I came to the conclusion that they can be employed to improve Urchin's dynamic URL handling.
Nice Try: The DynamicURL Filter
Urchin 4 simply strips query parameters from every URL. As this is obviously not what many people want, an advanced filter called DynamicURL can be used to match certain query parameters and map them to static looking URLs. For example,
/catalog.php?prodID=12
can be mapped to
/catalog.php/prodID/12
by using a DynamicURL filter with the following regular expression
(/catalog\.php\?)(prodID=.*)
The DynamicURL filter compares every URL against a regular expression. The regular expression should contain at least one pair of parentheses. If it matches, the URL will be replaced by another URL that consists of the text matched by each pair of parentheses with a slash in between.
The DynamicURL filter isn’t a very elegant solution. It creates artificial URLs that cannot be clicked in the statistics page because they do not point to the original page. More importantly, they impose a certain order of query parameters. Unless a certain order can be guaranteed, one filter is needed for every possible permutation of parameters. This can get easily out of hand for more than two parameters. A URL with three parameters like
/script.php?a=1&b=2&c=3
for example, needs six filters:
(/script\.php\?)(a=.*)&(b=.*)&(c=.*) (/script\.php\?)(a=.*)&(c=.*)&(b=.*) (/script\.php\?)(b=.*)&(c=.*)&(a=.*) (/script\.php\?)(b=.*)&(a=.*)&(c=.*) (/script\.php\?)(c=.*)&(a=.*)&(b=.*) (/script\.php\?)(c=.*)&(b=.*)&(a=.*)
Using a single, more generic filter as in
(/script\.php\?)([a-z]+=.*)&([a-z]+=.*)&([a-z]+=.*)
only works if there are no additional parameters that should not be mapped. Either way, every permutation will generate a distinct report entry. For example,
/script.php?a=1&b=2&c=3
maps to /script.php/a=1/b=2/c=3
and /script.php?b=2&a=1&c=3
, although semantically equivalent, maps to /script.php/b=2/a=1/c=3
.
Two Different Kinds of Parameters
With version 5, Urchin goes the other extreme by stripping all URL parameters and listing them in a separate report (Page Query Terms). In some situations, this might be the right thing to do. In others situations, especially with content management systems, it will most likely be undesirable. The key concept that Urchin fails to implement is that URL parameters fall into two sets. The first set of parameters identifies pages. The second set determines slight alterations to the content of these pages. The URL
/addToCart.php?catalog=1&product=2&session=654372392
for example, has three parameters. Two of those define a page: product and catalog. The session parameter holds the session id and is used to identify one particular browser session. The URL
/addToCart.php?product=2&catalog=1&session=654372392
points to the very same page. Consequently, both URLs should be listed in the statistics under the single entry
/addToCart.php?catalog=1&product=2
A web log analysis application should
- allow the user to specify which parameters define pages. These parameters should be listed as part of the URL in the report the other should be broken down into a separate report.
- be insensitive to the order in which parameters occur in URLs.
- produce reports in which entries can be clicked in order to get to the respective page.
Urchin 5: Flexible Parameter Handling Using Advanced Filters
In the remainder of this article, I will show how a well-crafted chain of filters can be used to accomplish all of the above requirements with Urchin 5. In order to implement the steps outlined below, an Urchin account with administrative rights must be used. This solution will only be effective if your naming of parameters is more or less consistent. However, the order in which parameters occur in URLs is not significant.
Under Configuration Urchin Profiles Filter Manager, create the following filters:
The first filter adds a leading ampersand to the query such that we can safely assume that every parameter starts with an ampersand. Urchin is robust enough to ignore superfluous ampersands so adding sentinels like that is safe.
Name |
Add & to the beginning of request_query
|
||
Filter Type |
Advanced
|
||
Field A |
request_query
|
Extract A |
^[^&]
|
Field B |
request_query
|
Extract B |
(.*)
|
Output To |
request_query
|
Constructor |
&$B1
|
Override Output Field |
Yes
|
||
Required Fields |
A Required Only
|
||
Case Sensitive |
No
|
The second filter adds a trailing question mark to every request stem. A request stem is the part of the URL before the query. Adding a sentinel helps us when we append parameters to the stem later on but we will have to clean this up eventually with another filter.
Name |
Add ? to the end of request_stem
|
||
Filter Type |
Advanced
|
||
Field A |
request_stem
|
Extract A |
[^?]$
|
Field B |
request_stem
|
Extract B |
(.*)
|
Output To |
request_stem
|
Constructor |
$B1?
|
Override Output Field |
Yes
|
||
Required Fields |
A Required Only
|
||
Case Sensitive |
No
|
Add two filters for every parameter that belongs to the set parameters identifying pages. One filter copies the parameter from the request query to the request stem. Replace <p> with the name of the actual parameter.
Name |
Copy <p> to request_stem
|
||
Filter Type |
Advanced
|
||
Field A |
request_stem
|
Extract A |
(.*)
|
Field B |
Request_query
|
Extract B |
(&<p>=[^&]+)
|
Output To |
Request_stem
|
Constructor |
$A1$B1
|
Override Output Field |
Yes
|
||
Required Fields |
Both Fields
|
||
Case Sensitive |
No
|
The other filter removes the parameter from the request_query. Replace <p> with the name of the actual parameter. Ideally, the replacement string should be empty but Urchin doesn’t let us do that. Fortunately, Urchin ignores the superfluous ampersands left in the query by this filter.
Name |
Remove <p> from request_query
|
Filter Type |
Search & Replace
|
Filter Field |
request_query
|
Search String |
&<p>=[^&]+
|
Replace String |
&
|
Case Sensitive |
No
|
Repeat the above two filters for every parameter you would like to be part of the page URL in your Urchin reports. After that, we need two more filters to cleanup the sentinels we added in the beginning. First, let’s remove the leading ampersand.
Name |
Strip ?& from request_stem
|
Filter Type |
Search & Replace
|
Filter Field |
request_stem
|
Search String |
\?&
|
Replace String |
?
|
Case Sensitive |
No
|
Get rid of trailing question marks. Some of the question marks added earlier are now followed by query parameters. These should not be removed.
Name |
Strip trailing ? from request_stem
|
||
Filter Type |
Advanced
|
||
Field A |
request_stem
|
Extract A |
\?$
|
Field B |
request_stem
|
Extract B |
^([^?]+)\?$
|
Output To |
request_stem
|
Constructor |
$B1
|
Override Output Field |
Yes
|
||
Required Fields |
Both Fields
|
||
Case Sensitive |
No
|
Now, add the previously created filters to some or all of your log sources. Under Configuration Urchin Profiles Log Manager, click Edit to edit a log source and go to the Log Filters tab. Add the filters one by one in the following sequence:
Add & to the beginning of request_query Add ? to the end of request_stem Copy <p1> to request_stem Remove <p1> from request_query … Copy <pN> to request_stem Remove <pN> from request_query Strip ?& from request_stem Strip trailing ? from request_stem
The order in which you add the "Copy <p> to request_stem"
filters to the Log source will be the order in which the respective parameters appear in the page entry in your Urchin reports. For example, if you add them in the order
Copy <p1> to request_stem Remove <p1> from request_query Copy <p2> to request_stem Remove <p2> from request_query
the report entry will look like
example.php?p1=…p2=…
regardless of whether the requested URL was
example.php?p1=…p2=…
or
example.php?p2=…p1=…