Ryan
Ryan

Reputation: 24482

Website analysis tool - how to determine unique pages from set of urls

How would a web analytics package such as piwik/google analytics/omniture etc determine what are unique pages from a set of urls?

E.g. a) a site could have the following pages for a product catalogue

or b) use query string

In either case you can have extra query string vars for things like affiliate links or other uses so how could you determine that its the same page?

e.g. both of these are for the foo product pages listed above.

If you ignore all the query string then all products in catalogue.xxx are collated into one page view.

If you don't ignore the query string then any extra query string params look like different pages.

If you're dealing with 3rd party sites then you can't assume that they are using either method or rely on something like canonicallinks being correct.

How could you tackle this?

Upvotes: 0

Views: 263

Answers (2)

CrayonViolent
CrayonViolent

Reputation: 32537

different tracking tools handle it differently, but you can explicitly set the reporting URL for all the tools.

For instance, Omniture doesn't care about the query string. It will chop it off, even if you don't specify a pageName and it defaults to the URL in the pages report, it still chops off the query string.

GA will record the full url including query string every time.

Yahoo Web Analytics only records the query string on first page of the visit and every page afterwards it removes it.

But as mentioned, all of the tools have a way to explicitly specify the URL to reported, and it is easy to write a bit of javascript to remove the Query string from the URL and pass that as the URL to report.

You mentioned giving your tracking code to 3rd parties. Since you are already giving them tracking code, it's easy enough to throw that extra bit of javascript into the tracking code you are already giving them.

For example, with GA (async version), instead of

_gaq.push(['_trackPageview']);

you would do something like

var page = location.href.split('?');
_gaq.push(['_trackPageview',page[0]]);

edit:

Or...for GA you can actually specify to exclude them within the report tool. Different tools may or may not do this for you, so code example can be applied to any of the tools (but popping their specific URL variable, obviously)

Upvotes: 1

Kevin Lacquement
Kevin Lacquement

Reputation: 5117

If you're dealing with third-party sites, you can't assume that their URLs follow any specific format either. You can try downloading the pages and comparing them locally, but even that is unreliable because of issues like rotating advertisement, timestamps, etc.

If you are dealing with a single site (or a small group of them), you can make a pattern to match each URL to a canonical (for you) form. However, this will get unmanageable quickly.

Of course, this is the reason that search engines like Google recommend the use of rel='canonical' links in the page header; if Google has issues telling the pages apart, it's not a trivial problem.

Upvotes: 1

Related Questions