how to remove duplicate rows when using array_agg in bigquery

Question

I have the following data:

WITH data as (
  SELECT 18 AS value, 1 AS id, "A" AS other_value,
  UNION ALL SELECT 20 AS value, 1 AS id, "B",
  UNION ALL SELECT 22 AS value, 2 AS id, "C"
  UNION ALL SELECT 30 AS value, 3 AS id, "A"
  UNION ALL SELECT 37 AS value, 2 AS id, "B"
  UNION ALL SELECT 31 AS value, 2 AS id, "C"
  UNION ALL SELECT 42 AS value, 1 AS id, "D"
)

I am using the following query

select
   FIRST_VALUE(id) over w1 as id
 , ARRAY_AGG(value) over w1  as data
 , FIRST_VALUE(other_value) over w1 as first_other_data
 , LAST_VALUE(other_value) over w1  as last_other_data
from data
WINDOW w1 as (PARTITION BY id order by value ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

And I get

id  data  first_other_data  last_other_data 
1   18    A                 D 
    20
    42
1   18    A                 D
    20
    42
1   18    A                 D
    20
    42
2   22    C                 B
    31
    37
2   22    C                 B
    31
    37
2   22    C                 B
    31
    37
3   30    A                 A

But i am getting duplicates that I don't want. I was thinking to use distinct keyword, but bigquery do not like it. My expected result is :

id  data  first_other_data  last_other_data     
1   18     A                 D
    20
    42
2   22     C                 B
    31
    37
3   30     A                 A

I have found similar questions but not exactly this case. Thanks EDIT: In my attempt to simplify the scenario for this SO question I took out some essential components. I have modified this with a more accurate version of my problem.

Mikhail Berlyant · Accepted Answer

The easiest way to achieve your goal without changing your initial query is to wrap it with extra select as in below example

#standardSQL
SELECT ANY_VALUE(t).* FROM (
  SELECT
     FIRST_VALUE(id) OVER w1 AS id
   , ARRAY_AGG(value) OVER w1  AS data
   , FIRST_VALUE(other_value) OVER w1 AS first_other_data
   , LAST_VALUE(other_value) OVER w1  AS last_other_data
  FROM data
  WINDOW w1 AS (PARTITION BY id ORDER BY value ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
) t
GROUP BY FORMAT('%t', t)

with result

how to remove duplicate rows when using array_agg in bigquery

Answers (2)

Related Questions