KayEss
KayEss

Reputation: 419

Aggregate same row groups into one row

I have a table with 30 columns. All but two column (CropVariety, Date) values in the same row groups are the same. I want to aggregate groups of rows with the same column values (except CropVariety, Date) into one row, stack different column values into one cell.

Exapmple:

ID      Field      Year IDFarm Farm   Type         CropVariety  Date
1222980 MU-9 - EKO 2020 219255  MU    Cover Crop   Rice         2020-04-03
1222980 MU-9 - EKO 2020 219255  MU    Cover Crop   Soy          2020-04-07
3333333 AB-9 - EKO 2020 444444  AB    Cover Crop   Lentil       2020-05-03
3333333 AB-9 - EKO 2020 444444  AB    Cover Crop   Hemp         2020-05-07


Desired output:

ID      Field      Year IDFarm Farm   Type         CropVariety  Date
1222980 MU-9 - EKO 2020 219255  MU    Cover Crop   Rice, Soy    2020-04-03, 2020-04-07
3333333 AB-9 - EKO 2020 444444  AB    Cover Crop   Lentil, Hemp 2020-05-03, 2020-05-07

I assume i should use ARRAY_AGG or ARRAY_CONCAT_AGG function, but my formulas were always wrong so i don't know really how to use is.

Upvotes: 0

Views: 336

Answers (2)

Yun Zhang
Yun Zhang

Reputation: 5503

Update: check Mikhail's answer for the syntax to specify columns in batch.

======= You said ARRAY_AGG but your expected results look like a string. You can use query below (and replace STRING_AGG() with ARRAY_AGG() if you do want array)

WITH data AS (
    SELECT 1222980 ID, 'MU-9 - EKO' Field, 2020 Year, "Rice" CropVariety, DATE('2020-04-03') Date
    UNION ALL
    SELECT 1222980 ID, 'MU-9 - EKO' Field, 2020 Year, "Soy" CropVariety, DATE('2020-04-07') Date
    UNION ALL
    SELECT 3333333 ID, 'AB-9 - EKO' Field, 2020 Year, "Lentil" CropVariety, DATE('2020-05-03') Date
    UNION ALL
    SELECT 3333333 ID, 'AB-9 - EKO' Field, 2020 Year, "Hemp" CropVariety, DATE('2020-05-07') Date
)
SELECT ID, Field, Year, string_agg(CropVariety, ', '), string_agg(CAST(Date AS STRING), ', ')
FROM data
GROUP BY 1,2,3

Output:

+---------+------------+------+--------------+------------------------+
|   ID    |   Field    | Year |     f0_      |          f1_           |
+---------+------------+------+--------------+------------------------+
| 1222980 | MU-9 - EKO | 2020 | Rice, Soy    | 2020-04-03, 2020-04-07 |
| 3333333 | AB-9 - EKO | 2020 | Lentil, Hemp | 2020-05-03, 2020-05-07 |
+---------+------------+------+--------------+------------------------+

Upvotes: 1

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173003

I have a table with 30 columns.

Typing 30 (or in many practical case even more) column names in SELECT list and GROUP BY clause is always a pain and source for typos, etc.

All but two column (CropVariety, Date) values in the same row groups are the same.

Obviously, would be great to use only those two column names in query

So, below is generic solution for such cases (BigQuery Standard SQL)

#standardSQL
SELECT ANY_VALUE(t).* EXCEPT(CropVariety, Date), 
  STRING_AGG(CropVariety, ', ') AS CropVariety, 
  STRING_AGG(CAST(Date AS STRING), ', ') AS Date
FROM `project.dataset.table` t
GROUP BY TO_JSON_STRING((SELECT AS STRUCT * EXCEPT(CropVariety, Date) FROM UNNEST([t])))

As you can see here - only CropVariety and Date column names are used explicitly - the rest is taken care by code

If to apply to sample data from your question - output is

Row ID          Field           Year    CropVariety     Date     
1   1222980     MU-9 - EKO      2020    Rice, Soy       2020-04-03, 2020-04-07   
2   3333333     AB-9 - EKO      2020    Lentil, Hemp    2020-05-03, 2020-05-07  

Upvotes: 2

Related Questions