DarioB
DarioB

Reputation: 1609

how to remove duplicate rows when using array_agg in bigquery

I have the following data:

WITH data as (
  SELECT 18 AS value, 1 AS id, "A" AS other_value,
  UNION ALL SELECT 20 AS value, 1 AS id, "B",
  UNION ALL SELECT 22 AS value, 2 AS id, "C"
  UNION ALL SELECT 30 AS value, 3 AS id, "A"
  UNION ALL SELECT 37 AS value, 2 AS id, "B"
  UNION ALL SELECT 31 AS value, 2 AS id, "C"
  UNION ALL SELECT 42 AS value, 1 AS id, "D"
)

I am using the following query

select
   FIRST_VALUE(id) over w1 as id
 , ARRAY_AGG(value) over w1  as data
 , FIRST_VALUE(other_value) over w1 as first_other_data
 , LAST_VALUE(other_value) over w1  as last_other_data
from data
WINDOW w1 as (PARTITION BY id order by value ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

And I get

id  data  first_other_data  last_other_data 
1   18    A                 D 
    20
    42
1   18    A                 D
    20
    42
1   18    A                 D
    20
    42
2   22    C                 B
    31
    37
2   22    C                 B
    31
    37
2   22    C                 B
    31
    37
3   30    A                 A

But i am getting duplicates that I don't want. I was thinking to use distinct keyword, but bigquery do not like it. My expected result is :

id  data  first_other_data  last_other_data     
1   18     A                 D
    20
    42
2   22     C                 B
    31
    37
3   30     A                 A

I have found similar questions but not exactly this case. Thanks EDIT: In my attempt to simplify the scenario for this SO question I took out some essential components. I have modified this with a more accurate version of my problem.

Upvotes: 2

Views: 1952

Answers (2)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

The easiest way to achieve your goal without changing your initial query is to wrap it with extra select as in below example

#standardSQL
SELECT ANY_VALUE(t).* FROM (
  SELECT
     FIRST_VALUE(id) OVER w1 AS id
   , ARRAY_AGG(value) OVER w1  AS data
   , FIRST_VALUE(other_value) OVER w1 AS first_other_data
   , LAST_VALUE(other_value) OVER w1  AS last_other_data
  FROM data
  WINDOW w1 AS (PARTITION BY id ORDER BY value ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
) t
GROUP BY FORMAT('%t', t)  

with result

enter image description here

Upvotes: 2

Martin Weitzmann
Martin Weitzmann

Reputation: 4736

I think you still want an aggregation with grouping :) Since value and other_value seem to be related (you order by value in the window to select other_value) I'd simply put them into the array, too:

WITH data as (
  SELECT 18 AS value, 1 AS id, "A" AS other_value,
  UNION ALL SELECT 20 AS value, 1 AS id, "B",
  UNION ALL SELECT 22 AS value, 2 AS id, "C"
  UNION ALL SELECT 30 AS value, 3 AS id, "A"
  UNION ALL SELECT 37 AS value, 2 AS id, "B"
  UNION ALL SELECT 31 AS value, 2 AS id, "C"
  UNION ALL SELECT 42 AS value, 1 AS id, "D"
)

select 
  id,
  array_agg(struct(value, other_value)) as data
from data
group by id

So, if you need it later you can write a subquery to get it - or you can add another step to do in one query with the aggregation in between:

WITH data as (
  SELECT 18 AS value, 1 AS id, "A" AS other_value,
  UNION ALL SELECT 20 AS value, 1 AS id, "B",
  UNION ALL SELECT 22 AS value, 2 AS id, "C"
  UNION ALL SELECT 30 AS value, 3 AS id, "A"
  UNION ALL SELECT 37 AS value, 2 AS id, "B"
  UNION ALL SELECT 31 AS value, 2 AS id, "C"
  UNION ALL SELECT 42 AS value, 1 AS id, "D"
), 
temp as (
  select 
    id,
    array_agg(struct(value, other_value)) as data
  from data
  group by id
)
select 
  id,
  array(select value from unnest(data)) as data,
  (select other_value from unnest(data) order by value ASC limit 1) first_other_data,
  (select other_value from unnest(data) order by value DESC limit 1) last_other_data
from temp

Upvotes: 3

Related Questions