Reputation: 78
Sample data below.
I want to clean up data based on the next non-null value of the same id, based on row (actually a timestamp).
The only thing I can think of is to do
table_x as (select id, col_x from table where col_a is not null)
for each column, and then join taking the minimum where id = id and table_x.row > table.row. But I have a handful of columns and that feels cumbersome and inefficient.
Appreciate any help!
row | id | col_a | col_a_desired | col_b | col_b_desired |
---|---|---|---|---|---|
0 | 1 | - | NYC | red | red |
1 | 1 | NYC | NYC | red | red |
2 | 1 | SF | SF | - | blue |
3 | 1 | - | SF | - | blue |
4 | 1 | SF | SF | blue | blue |
5 | 2 | PAR | PAR | red | red |
6 | 2 | LON | LON | - | blue |
7 | 2 | LON | LON | - | blue |
8 | 2 | - | LON | blue | blue |
9 | 2 | LON | LON | - | blue |
10 | 2 | - | LON | - | blue |
Upvotes: 1
Views: 1000
Reputation: 1377
Can you try this query?
WITH samp AS (
SELECT 0 row_id, 1 id, null col_a, 'red' col_b UNION ALL
SELECT 1, 1, 'NYC', 'red' UNION ALL
SELECT 2, 1, 'SF', NULL UNION ALL
SELECT 3, 1, NULL, NULL UNION ALL
SELECT 4, 1, 'SF', 'blue' UNION ALL
SELECT 5, 2, 'PAR', 'red' UNION ALL
SELECT 6, 2, 'LON', NULL UNION ALL
SELECT 7, 2, 'LON', NULL UNION ALL
SELECT 8, 2, NULL, 'blue' UNION ALL
SELECT 9, 2, 'LON', NULL UNION ALL
SELECT 10, 2, NULL, NULL
)
SELECT
row_id,
id,
IFNULL(FIRST_VALUE(col_a IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
FIRST_VALUE(col_a IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id desc
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_a,
IFNULL(FIRST_VALUE(col_b IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
FIRST_VALUE(col_b IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id desc
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_b
from samp order by id, row_id
References: https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions#first_value https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls
Upvotes: 2
Reputation: 11581
I want to clean up data based on the next non-null value.
So if you reverse the order, that's the last non-null value.
If you have multiple columns and the logic is too cumbersome to write in SQL, you can write it in plpgsql instead, or even use the script language of your choice (but that will be slower).
The idea is to open a cursor for update, with an ORDER BY in the reverse order mentioned in the question. Then the plpgsql code stores the last non-null values in variables, and if needed issues an UPDATE WHERE CURRENT OF cursor to replace the nulls in the table with desired values.
This may take a while, and the numerous updates will take a lot of locks. It looks like your data can be processed in independent chunks using the "id" column as chunk identifier, so it would be a good idea to use that.
Upvotes: 1