Mario M.
Mario M.

Reputation: 872

BigQuery Standard Get first not null value when grouping

I have a table like this:

CUSTOMERS_ID  DATE_SALES  DIMENSION
MARIO1        20200201    NULL
MARIO1        20200113    Spain
MARIO2        20200131    NULL
MARIO3        20200101    France
MARIO3        20191231    Spain

and I need to order by CUSTOMERS_ID and DATE_SALES DESC fields. Then I want to group by CUSTOMERS_ID field and get first not null value of DIMENSION field. The output table would be:

CUSTOMERS_ID  DIMENSION
MARIO1        Spain
MARIO2        NULL
MARIO3        France

Any ideas? I have tried COALESCE functions, FIRST_VALUE, and I have not the results I had expected.

Thanks in advance!

Upvotes: 7

Views: 4180

Answers (3)

Sabri Karagönen
Sabri Karagönen

Reputation: 2365

You can group by customer id and use ARRAY_AGG by ignoring NULLS, and you can also order by date in that field. Limit 1 will make it more efficient by using less RAM storage. And then, OFFSET(0) will make it an unnested field, so you can use that field easily.

WITH 
raw_data AS
(
  SELECT 'MARIO1' CUSTOMERS_ID, 20200201 DATE_SALES, NULL as DIMENSION UNION ALL
  SELECT 'MARIO1' CUSTOMERS_ID, 20200113 DATE_SALES, 'Spain' as DIMENSION UNION ALL
  SELECT 'MARIO2' CUSTOMERS_ID, 20200131 DATE_SALES, NULL as DIMENSION UNION ALL
  SELECT 'MARIO3' CUSTOMERS_ID, 20200101 DATE_SALES, 'France' as DIMENSION UNION ALL
  SELECT 'MARIO3' CUSTOMERS_ID, 20191231 DATE_SALES, 'Spain' as DIMENSION
)
SELECT CUSTOMERS_ID, ARRAY_AGG(DIMENSION IGNORE NULLS ORDER BY DATE_SALES DESC LIMIT 1)[OFFSET(0)] as DIMENSION
FROM raw_data
GROUP BY 1

Upvotes: 10

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173190

Below is for BigQuery Standard SQL

#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY IF(DIMENSION IS NULL, NULL, DATE_SALES) DESC LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY CUSTOMERS_ID   

if to apply to sample data from your question - result is

Row CUSTOMERS_ID    DATE_SALES  DIMENSION    
1   MARIO1          20200113    Spain    
2   MARIO2          20200131    null     
3   MARIO3          20200101    France   

Upvotes: 3

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522732

We can just use a ROW_NUMBER trick here:

WITH cte AS (
SELECT CUSTOMERS_ID,
       ROW_NUMBER() OVER (PARTITION BY CUSTOMERS_ID
                          ORDER BY -1.0*UNIX_SECONDS(DATE_SALES) DESC) rn
    FROM yourTable
)

SELECT CUSTOMERS_ID, DIMENSION
FROM cte
WHERE rn = 1
ORDER BY CUSTOMERS_ID;

The logic is to order row number descending by the negative number of seconds since the epoch. This places more recent sales first, and also places NULL at the very end, so a NULL value would only receive row number 1 in the event that no non NULL dimension data be available.

Upvotes: 1

Related Questions