Massive differences between Google Analytics and own data collection

Question

The use of a web app is to be evaluated statistically. It has been publicly available since spring of this year.

The web app is linked to Google Analytics. The following is done for the own user data collection:

A Unique User ID is created when the web app is called for the first time. It is stored in the localStorage and is compared each time the page is called up again.

if (localStorage.getItem("uuid") === null) {
    localStorage.setItem("uuid", get_uuid());
}

function get_uuid() {
  return ([1e7]+-1e3+-4e3+-8e3+-1e11).replace(/[018]/g, c =>
    (c ^ crypto.getRandomValues(new Uint8Array(1))[0] & 15 >> c / 4).toString(16)
  )
}

This data is written to a database together with other information (concrete page, time, device type, etc.). Users without Javascript or localStorage will not be included; however, they will probably not be able to use the web app correctly anyway.

If I now compare the data from Google Analytics with my own variant, the discrepancy is considerable.

Different users according to Google: about 900
Different users due to UUID: about 400

Additionally about 100 visits (or interactions) without UUID were registered.

Now my question is why these big differences exist. In my opinion, my data collection should be pretty accurate. But maybe I have a thinking error with the approach of the UUID? Or could it be that Google counts quite differently; for example, any robots that don't leave a UUID behind?

Thank you very much for your answers and considerations.

Andreas · Accepted Answer

I'm quite sure you have encountered Google Analytics (GA) spam.
This is because GA is JavaScript and your ID is listed in the html source.

So anyone who wants to create spam on your data can use your ID.
Why you ask... When you notice it you see that there are webpages listed you don't know in your GA data, you (the admin) open them and get a virus or worse.
Don't open the webpages...

There are as far as I know two ways to fix it. Regex filter wich is a common way.
All webpages that has refferals from other domains you don't "know" you need to block.
This takes time and is not a good approach.

My method is to pass a dimension from the html to GA.
If that dimension is missing the data is not real.

Your JavaScript probably looks something like:

.....
 ga('require', 'linkid', 'linkid.js');
  ga('require', 'displayfeatures');
  ga('send', 'pageview');

If we add a dimension which we pick up in GA admin tools

.....
 ga('require', 'linkid', 'linkid.js');
  ga('require', 'displayfeatures');
  ga('send', 'pageview', {
      'dimension1':  'FooBar'
    });

Go to admin -> Property (the middle column) and at the bottom you have Dd Custom Definitions. Open Custom Dimensions and add the dimension you added to the html.

Now you can set up a filter in the view tab of GA admin to only show data with your custom dimension "FooBar".

Any data that does not have this "FooBar" is spam that is not generated from your webpage.

Just remember you need to change all GA JavaScript codes and add the dimension.

You can see this spam (if I'm correct) in the Acquisition -> All Traffic -> Referrals report.
If you see Sources that you don't recognize and looks odd it's most likely the spam.
Before I used this method my Referrals looked something like this, there is about 50 of these fake referrals.

Massive differences between Google Analytics and own data collection

Answers (1)

Related Questions