Reputation: 295
I've got a Standard SQL query written to pull some Google Analytics data out of BigQuery. I've pasted a simplified version below that I use for testing:
SELECT COUNT(DISTINCT(session_id))
FROM (
SELECT
CONCAT(CAST(fullVisitorId AS STRING), '.', CAST(visitStartTime AS STRING)) AS session_id
FROM `myproject.ga_sessions_20180227`, UNNEST(hits) AS hits
);
Running this query outputs a unique session count of 6,696, which is accurate when compared to the GA UI.
However, when I introduce customDimensions into the query, I lose a number of rows. Running this query only outputs 6,606 unique sessions.
SELECT COUNT(DISTINCT(session_id))
FROM (
SELECT
CONCAT(CAST(fullVisitorId AS STRING), '.', CAST(visitStartTime AS STRING)) AS session_id,
MAX(IF(customs.index = 14, customs.value, NULL)) AS custom_id
FROM `myproject.ga_sessions_20180227`, UNNEST(customDimensions) AS customs
GROUP BY fullVisitorId, visitStartTime
);
I'm not using hits.customDimensions, because the custom dimension in question is based on the user level, not the hit level. If I don't group the output, I get an error.
How can I unnest a non-hit based custom dimension without losing rows?
Upvotes: 1
Views: 4098
Reputation: 4746
The problem is the CROSS JOIN
(the comma): when cross joining a table with NULL
- or in this case an empty customDimensions - the result is NULL
in total instead of only the left table.
You want to use LEFT JOIN
instead of CROSS JOIN
because this preserves the left table:
FROM `table` LEFT JOIN customDimensions AS cd
Besides that you can also use a simple sub-select instead of joining sub-tables/arrays:
SELECT
(SELECT value FROM t.customDimensions WHERE index=1) AS cd1
, COUNT(1) AS Qty
FROM `project.dataset.ga_sessions_20180202` AS t
GROUP BY 1
Upvotes: 2