knoll
knoll

Reputation: 295

Unnesting customDimensions w/ BigQuery

I've got a Standard SQL query written to pull some Google Analytics data out of BigQuery. I've pasted a simplified version below that I use for testing:

SELECT COUNT(DISTINCT(session_id))
FROM (
  SELECT
    CONCAT(CAST(fullVisitorId AS STRING), '.', CAST(visitStartTime AS STRING)) AS session_id
  FROM `myproject.ga_sessions_20180227`, UNNEST(hits) AS hits
);

Running this query outputs a unique session count of 6,696, which is accurate when compared to the GA UI.

However, when I introduce customDimensions into the query, I lose a number of rows. Running this query only outputs 6,606 unique sessions.

SELECT COUNT(DISTINCT(session_id))
FROM (
  SELECT
    CONCAT(CAST(fullVisitorId AS STRING), '.', CAST(visitStartTime AS STRING)) AS session_id,
    MAX(IF(customs.index = 14, customs.value, NULL)) AS custom_id
  FROM `myproject.ga_sessions_20180227`, UNNEST(customDimensions) AS customs
  GROUP BY fullVisitorId, visitStartTime
);

I'm not using hits.customDimensions, because the custom dimension in question is based on the user level, not the hit level. If I don't group the output, I get an error.

How can I unnest a non-hit based custom dimension without losing rows?

Upvotes: 1

Views: 4098

Answers (1)

Martin Weitzmann
Martin Weitzmann

Reputation: 4746

The problem is the CROSS JOIN (the comma): when cross joining a table with NULL - or in this case an empty customDimensions - the result is NULL in total instead of only the left table. You want to use LEFT JOIN instead of CROSS JOIN because this preserves the left table:

FROM `table` LEFT JOIN customDimensions AS cd

Besides that you can also use a simple sub-select instead of joining sub-tables/arrays:

SELECT
  (SELECT value FROM t.customDimensions WHERE index=1) AS cd1
  , COUNT(1) AS Qty
FROM `project.dataset.ga_sessions_20180202` AS t
GROUP BY 1

Upvotes: 2

Related Questions