Reputation: 899
I have 4 columns
date number Estimate Client
---- ------
1 3 10 A
2 NULL 10 Null
3 5 10 A
4 NULL 10 Null
5 NULL 10 Null
6 2 10 A
.......
I need to replace the NULL values with new values takes on the value from the last known value in the previous date in the date column eg: date=2 number = 3, date 4 and 5 number = 5 and 5. The NULL values appear randomly.
This needs to be done in Hive.
Upvotes: 2
Views: 9353
Reputation: 1518
if you are using SQL then below query cann help. Else you can use pandas ffill and bfill function .
select primary_key_val,country,
COALESCE(country, LAST_VALUE(country, TRUE) OVER(partition by primary_key_val order **by eff_start_dt ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) as upd_country,**
eff_start_dt from dim_acct_keys order by primary_key_val,eff_start_dt
Some Data for example:
+------------------+----------+--------------+---------------+
| primary_key_val | country | upd_country | eff_start_dt |
+------------------+----------+--------------+---------------+
| act1010 | USA | USA | 20190101 |
| act1010 | NULL | USA | 20190102 |
| act1010 | NULL | USA | 20190103 |
| act1012 | USA | USA | 20190101 |
| act1012 | NULL | USA | 20190102 |
| act1012 | MEX | MEX | 20190103 |
Upvotes: 1
Reputation: 4682
This about sliding window;
Here is my table content;
hive> select * from my_table;
OK
1 3 10 A
2 NULL 10 NULL
3 5 10 A
4 NULL 10 NULL
5 NULL 10 NULL
6 2 10 A
Time taken: 0.06 seconds, Fetched: 6 row(s)
All you need to do is to slide over a window between preceedings and current row and find most recent not null value. LAST_VALUE
windowable function has an argument to ignore null values as boolean. LAST_VALUE(<field>,<ignore_nulls> as boolean)
;
SELECT
COALESCE(`date`, LAST_VALUE(`date`, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),
COALESCE(number, LAST_VALUE(number, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),
COALESCE(estimate, LAST_VALUE(estimate, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),
COALESCE(client, LAST_VALUE(client, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))
FROM my_table;
The result will be;
OK
1 3 10 A
2 3 10 A
3 5 10 A
4 5 10 A
5 5 10 A
6 2 10 A
Time taken: 19.177 seconds, Fetched: 6 row(s)
Upvotes: 8
Reputation: 3897
Here is a solution using standard hiveql joins. This should work on all versions of hive. Table c merges closest date of the client that is not null. Later table d merges on the number associated with that date. Using coalesce the appended value is only used when the number is null.
select c.date
, coalesce(c.number,d.number) as number
, c.client
, estimate
from
(select date
, max(prior_date) as prior_date -- nearest date not null number
, value
, estimate
, a.client
from
(select date
, value
, estimate
, client
from table_have
) a
left outer join
(select date as prior_date -- dates without nulls
, client
from table_have
where number is not null
) b
on a.client=b.client
where date > prior_dates
group by a.client, date, value
) c
left outer join
(select date
, number
, client
from table_have
where number is not null
) d
on c.client = d.client and c.prior_date=d.date
group by c.date, c.client, estimate
;
This query could be optimized more by using common table expressions similarly as used in the alternate solution. However, this solution does not require N repetitions of lines and should work universally. The number N required in the other solution may not be static as such this solution may work for the more general case.
Upvotes: 1
Reputation: 36545
This is actually a pretty tricky question, because Hive doesn't support recursive CTEs or correslated subqueries, which would be the usual way to solve this type of problem.
The only pure Hive way I can think of is to do a bunch of self joins. You'll have to do as many as maximum length of consecutive nulls in your data.
--add in row numbers
with T as
(select select *, row_number() over (order by date) rn
from mytable)
--main query
select T.date,
case when T.number is not null then T.number
else when T1.number is not null then T1.number
else when T2.number is not null then T2.number end as number
--repeat this N times
--where N is the length of the longest sequece of consectutive nulls
-- add in your other columns here
from T
join T T1 on T1.date = t.date - 1
join T T2 on T2.date = t.date - 2
--repeat this N times
Upvotes: 0