Rebecca
Rebecca

Reputation: 77

BigQuery - Remove duplicates from array

Using BigQuery, I want to group pages depending on their title with one query and calculate different metrics on the groups. As the rules on titles are not mutually exclusive, I've done it this way:

SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
CROSS JOIN
UNNEST([
    CASE WHEN (title LIKE '%game%') 
    THEN 'games_group' END, 
    CASE WHEN (title LIKE '%sport%') 
    THEN 'sports_group' END
]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group

Here is the result:

views       ...   title_group
3414469869  ... 
4355264     ...   games_group
1361074     ...   sports_group

However, the number 3414469869 for the views of the pages that don't belong to any group is wrong. Indeed, when a title doesn't contain "game" (or "sport"), we get UNNEST([null, "sports_group"]) (or UNNEST(["games_group", null])) so we still count the views for the null group. When a title doesn't contain "game" neither "sport", the views are even counted twice.

Is there a way to remove duplicates from the array ?

Upvotes: 1

Views: 1055

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1269773

How about adding another group?

SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` CROSS JOIN
     UNNEST([CASE WHEN title LIKE '%game%' THEN 'games_group' END, 
             CASE WHEN title LIKE '%sport%' THEN 'sports_group' END,
             CASE WHEN title NOT LIKE '%game%' AND title NOT LIKE '%sport%' THEN 'Neither' END
            ]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
      wiki = 'en' AND
      title_group IS NOT NULL
GROUP BY title_group;

Note: This doesn't take NULL titles into account. I don't know if that is important.

However, I would phrase this using two columns:

SELECT (title LIKE '%game%') as is_game,
       (title LIKE '%sport%') as is_sport,
       SUM(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
      wiki = 'en' AND
      title_group IS NOT NULL
GROUP BY is_game, is_sport;

This doesn't return the same rows as yours -- games and sports are split across two rows. But you can see the combinations.

EDIT:

Now that I think of this, you just want a LEFT JOIN:

SELECT g.title_group, SUM(pv.views) as views, 
FROM `fh-bigquery.wikipedia_v3.pageviews_2019` pv LEFT JOIN
     (SELECT '%game%' as pattern, 'games_group' as title_group UNION ALL
      SELECT '%sport%', 'sports_group' as title_group UNION ALL
     ) g
     ON pv.title LIKE g.pattern
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
      wiki = 'en' AND
GROUP BY g.title_group;

Upvotes: 2

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Below is for BigQuery Standard SQL

#standardSQL
SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`,
UNNEST(
    CASE WHEN REGEXP_CONTAINS(title, r'game|sport') THEN 
      [
        CASE WHEN (title LIKE '%game%') THEN 'games_group' END,
        CASE WHEN (title LIKE '%sport%') THEN 'sports_group' END
      ]
      ELSE ['other']
    END
) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
AND   title_group IS NOT NULL
GROUP BY title_group

Upvotes: 0

Related Questions