Sample data below. I want to clean up data based on the next non-null value of the same id, based on row (actually a timestamp). I can't do lag, because in some cases there are consecutive nulls. I can't do coalesce(a.col_a, (select min(b.col_a) from table b where a.id=b.id)) because it will return an "outdated" value (eg NYC instead of SF in col_a row 4). (I can do this, once I've accounted for everything else, for the cases where i have no next non-null value, like col_b row 9/10, to just fill in the last). The only thing I can think of is to do table_x as (select id, col_x from table where col_a is not null) for each column, and then join taking the minimum where id = id and table_x.row > table.row. But I have a handful of columns and that feels cumbersome and inefficient. Appreciate any help! row id col_a col_a_desired col_b col_b_desired 0 1 - NYC red red 1 1 NYC NYC red red 2 1 SF SF - blue 3 1 - SF - blue 4 1 SF SF blue blue 5 2 PAR PAR red red 6 2 LON LON - blue 7 2 LON LON - blue 8 2 - LON blue blue 9 2 LON LON - blue 10 2 - LON - blue

postgresqlgoogle-cloud-platformgoogle-bigquery

juicero

Reputation: 78

Get the next (or previous) non-null value in multiple partitioned

Sample data below.

I want to clean up data based on the next non-null value of the same id, based on row (actually a timestamp).

I can't do lag, because in some cases there are consecutive nulls.
I can't do coalesce(a.col_a, (select min(b.col_a) from table b where a.id=b.id)) because it will return an "outdated" value (eg NYC instead of SF in col_a row 4). (I can do this, once I've accounted for everything else, for the cases where i have no next non-null value, like col_b row 9/10, to just fill in the last).

The only thing I can think of is to do

table_x as (select id, col_x from table where col_a is not null)

for each column, and then join taking the minimum where id = id and table_x.row > table.row. But I have a handful of columns and that feels cumbersome and inefficient.

Appreciate any help!

row	id	col_a	col_a_desired	col_b	col_b_desired
0	1	-	NYC	red	red
1	1	NYC	NYC	red	red
2	1	SF	SF	-	blue
3	1	-	SF	-	blue
4	1	SF	SF	blue	blue
5	2	PAR	PAR	red	red
6	2	LON	LON	-	blue
7	2	LON	LON	-	blue
8	2	-	LON	blue	blue
9	2	LON	LON	-	blue
10	2	-	LON	-	blue

Upvotes: 1

Answers (2)

Nestor

Reputation: 1377

Can you try this query?

WITH samp AS (
  SELECT 0 row_id, 1 id, null col_a, 'red' col_b UNION ALL
  SELECT 1, 1, 'NYC', 'red' UNION ALL
  SELECT 2, 1, 'SF', NULL UNION ALL
  SELECT 3, 1, NULL, NULL UNION ALL
  SELECT 4, 1, 'SF', 'blue' UNION ALL
  SELECT 5, 2, 'PAR', 'red' UNION ALL
  SELECT 6, 2, 'LON', NULL UNION ALL
  SELECT 7, 2, 'LON', NULL UNION ALL
  SELECT 8, 2, NULL, 'blue' UNION ALL
  SELECT 9, 2, 'LON', NULL UNION ALL
  SELECT 10, 2, NULL, NULL
)
  SELECT
  row_id,
  id,
  IFNULL(FIRST_VALUE(col_a IGNORE NULLS) 
    OVER (PARTITION BY id ORDER BY row_id
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
  FIRST_VALUE(col_a IGNORE NULLS) 
    OVER (PARTITION BY id ORDER BY row_id desc
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_a,
  IFNULL(FIRST_VALUE(col_b IGNORE NULLS) 
    OVER (PARTITION BY id ORDER BY row_id
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
  FIRST_VALUE(col_b IGNORE NULLS) 
    OVER (PARTITION BY id ORDER BY row_id desc
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_b
 from samp order by id, row_id

Output:

References: https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions#first_value https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls

Upvotes: 2

bobflux

Reputation: 11581

I want to clean up data based on the next non-null value.

So if you reverse the order, that's the last non-null value.

If you have multiple columns and the logic is too cumbersome to write in SQL, you can write it in plpgsql instead, or even use the script language of your choice (but that will be slower).

The idea is to open a cursor for update, with an ORDER BY in the reverse order mentioned in the question. Then the plpgsql code stores the last non-null values in variables, and if needed issues an UPDATE WHERE CURRENT OF cursor to replace the nulls in the table with desired values.

This may take a while, and the numerous updates will take a lot of locks. It looks like your data can be processed in independent chunks using the "id" column as chunk identifier, so it would be a good idea to use that.

Upvotes: 1

Get the next (or previous) non-null value in multiple partitioned

Answers (2)

Related Questions