user3142655
user3142655

Reputation: 109

Bigquery unnest hits - duplicating values)

Im trying to create a master view of a group in properties that are been imported into big query but it seem by using the unnest(hits) the SQL is duplicating the data leading to inaccurate values for revenues etc...

I have try to look at understanding why the unnest has caused this but I can't figure it out.

SELECT Date, hostname, channelGrouping, sum(transactionRevenue) as Revenue, sum(Shipping) as Shipping, sum(visits) as Sessions, sum(bounces) as Bounces, sum(transactions) as Transactions
    FROM
        (SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
        FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
        WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
        UNION ALL
        SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
        FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
        WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
        UNION ALL
        SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
        FROM `102674002.ga_sessions_*`, UNNEST(hits) AS h
        WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
        UNION ALL
        SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
        FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
        WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509'
        UNION ALL
        SELECT Date, h.page.hostname as hostname, channelGrouping, totals.transactionRevenue, totals.visits, h.transaction.transactionShipping as shipping, totals.bounces, totals.transactions
        FROM `xxxxxxxxx.ga_sessions_*`, UNNEST(hits) AS h
        WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170509')
    Group By Date, hostname, channelGrouping
    Order by Date

Upvotes: 1

Views: 5954

Answers (1)

Willian Fuks
Willian Fuks

Reputation: 11787

This might do the trick:

SELECT
  date,
  channelGrouping,
  SUM(Revenue) Revenue,
  SUM(Shipping) Shipping,
  SUM(bounces) bounces,
  SUM(transactions) transactions,
  hostname,
  COUNT(date) sessions
FROM(
  SELECT 
    date,
    channelGrouping,
    totals.totaltransactionrevenue / 1e6 Revenue,
    ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL)) hostnames,
    (SELECT SUM(hits.transaction.transactionshipping) / 1e6 FROM UNNEST(hits) hits) Shipping,
    totals.bounces bounces,
    totals.transactions transactions
  FROM `project_id.dataset_id.ga_sessions_*`
  WHERE 1 = 1
  AND ARRAY_LENGTH(ARRAY((SELECT DISTINCT page.hostname FROM UNNEST(hits) hits WHERE page.hostname IS NOT NULL))) > 0
  AND _TABLE_SUFFIX BETWEEN '20170601' AND '20170609'

  UNION ALL
  (...)

),
UNNEST(hostnames) hostname
GROUP BY
  date, channelGrouping, hostname

Notice that in this query I avoided applying the UNNEST operation in the hits field and I do so only inside subselects.

In order to understand why this is the case you have to understand how ga data is aggregated into BigQuery. Notice that we basically have 2 types of data: the session level data and the hits level. Each client visiting your website ends up generating a row into BigQuery, like so:

{fullvisitorid: 1, visitid:1, date: '20170601', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:0, bounces: 0}]

If the same customer comes back a day later it generates another row into BQ, something like:

{fullvisitorid: 1, visitid:2, date: '20170602', channelGrouping: "search", hits: [{hitNumber: 1, page: {hostname: "yourserverhostname"}}, {hitNumber: 2, page: {hostname: "yourserverhostname"}}, totals: {totalTransactionRevenue:50000000, bounces: 2}]

As you can see, fields outside the key hits are related to the session level (and therefore each hit, i.e, each interaction the customer has in your website, adds up another entry here). When you apply UNNEST, you basically, apply a cross-join with all values inside of the array to the outer fields.

And this is where duplication happens!

Given the past example, if we apply UNNEST to the hits field, you end up with something like:

fullvisitorid    visitid    totals.totalTransactionRevenue    hits.hitNumber
1                1          0                                 1
1                1          0                                 2
1                2          50000000                          1
1                2          50000000                          2

Notice that for each hit inside the hits field causes the outer fields, such as totals.totalTransactionRevenue to be duplicated for each hitNumber that happened inside the hits ARRAY.

So, if later on, you apply some operation like SUM(totals.totalTransactionRevenue) you end up summing this field multiplied by each hit that the customer had in that visitid.

What I tend to do is to avoid the (costly depending on the data volume) UNNEST operation on the hits field and I do so only in subqueries (where the unnesting happens only at the row level which does not duplicate data).

Upvotes: 4

Related Questions