Reputation: 77
Using BigQuery, I want to group pages depending on their title with one query and calculate different metrics on the groups. As the rules on titles are not mutually exclusive, I've done it this way:
SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
CROSS JOIN
UNNEST([
CASE WHEN (title LIKE '%game%')
THEN 'games_group' END,
CASE WHEN (title LIKE '%sport%')
THEN 'sports_group' END
]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group
Here is the result:
views ... title_group
3414469869 ...
4355264 ... games_group
1361074 ... sports_group
However, the number 3414469869 for the views of the pages that don't belong to any group is wrong. Indeed, when a title doesn't contain "game" (or "sport"), we get UNNEST([null, "sports_group"]
) (or UNNEST(["games_group", null])
) so we still count the views for the null group. When a title doesn't contain "game" neither "sport", the views are even counted twice.
Is there a way to remove duplicates from the array ?
Upvotes: 1
Views: 1055
Reputation: 1269773
How about adding another group?
SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` CROSS JOIN
UNNEST([CASE WHEN title LIKE '%game%' THEN 'games_group' END,
CASE WHEN title LIKE '%sport%' THEN 'sports_group' END,
CASE WHEN title NOT LIKE '%game%' AND title NOT LIKE '%sport%' THEN 'Neither' END
]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
wiki = 'en' AND
title_group IS NOT NULL
GROUP BY title_group;
Note: This doesn't take NULL
titles into account. I don't know if that is important.
However, I would phrase this using two columns:
SELECT (title LIKE '%game%') as is_game,
(title LIKE '%sport%') as is_sport,
SUM(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
wiki = 'en' AND
title_group IS NOT NULL
GROUP BY is_game, is_sport;
This doesn't return the same rows as yours -- games and sports are split across two rows. But you can see the combinations.
EDIT:
Now that I think of this, you just want a LEFT JOIN
:
SELECT g.title_group, SUM(pv.views) as views,
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` pv LEFT JOIN
(SELECT '%game%' as pattern, 'games_group' as title_group UNION ALL
SELECT '%sport%', 'sports_group' as title_group UNION ALL
) g
ON pv.title LIKE g.pattern
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
wiki = 'en' AND
GROUP BY g.title_group;
Upvotes: 2
Reputation: 172993
Below is for BigQuery Standard SQL
#standardSQL
SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`,
UNNEST(
CASE WHEN REGEXP_CONTAINS(title, r'game|sport') THEN
[
CASE WHEN (title LIKE '%game%') THEN 'games_group' END,
CASE WHEN (title LIKE '%sport%') THEN 'sports_group' END
]
ELSE ['other']
END
) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
AND title_group IS NOT NULL
GROUP BY title_group
Upvotes: 0