Harry Barsegyan
Harry Barsegyan

Reputation: 97

How to Ignore Nulls in Hive Lead Function

I have the following statement in a query which works fine in Presto:

CASE  
    WHEN  cte_v1.id_agent is not null 
         THEN cte_v1.id_agent
    WHEN  cte_v1.id_agent is null
          and lead(cte_v1.interaction_channel,1) ignore nulls over(partition by id_ticket, interaction_channel order by cte_v1.interaction_start_time) = 'messaging' 
        THEN lead(cte_v1.id_agent,1) ignore nulls over(partition by id_ticket, interaction_channel order by cte_v1.interaction_start_time) 
END as agent

however when I try to operationalize this query in Hive, I receive the following error 'FAILED: ParseException line 12:39 missing KW_THEN at 'ignore' near 'nulls' in statement'

As mentioned, this query works fine in Presto and gives the correct result by recreating the id_agent column and replaces any null agent ID with the next non-null value

agent id_agent interaction_start_time interaction_end_time inbound_or_outbound interaction_channel
364920063 364920063 2020-09-16 6:38:47 2020-09-16 6:38:47 Outbound messaging
364920063 NULL 2020-09-16 12:37:54 2020-09-16 12:37:54 Inbound messaging
364920063 364920063 2020-09-17 8:02:07 2020-09-17 8:02:07 Outbound messaging
191918955 191918955 2020-09-17 8:16:36 2020-09-17 8:25:45 Outbound messaging
191918955 NULL 2020-09-17 11:16:47 2020-09-17 11:16:47 Inbound messaging
191918955 NULL 2020-09-17 16:02:35 2020-09-17 16:11:30 Inbound messaging
191918955 191918955 2020-09-20 2:13:14 2020-09-20 2:13:14 Outbound messaging
191918955 191918955 2020-09-20 2:44:40 2020-09-20 2:46:09 Inbound messaging
191918955 191918955 2020-09-20 7:13:40 2020-09-20 7:13:40 Outbound messaging

Upvotes: 1

Views: 1602

Answers (1)

leftjoin
leftjoin

Reputation: 38290

Currently there are two open JIRAs both about adding IGNORE NULLs/RESPECT NULLs to LEAD/LAG: HIVE-18145 and HIVE-24869

But it seems your problem can be easily solved using first_value:

If you want to find next not null value you can use first_value function with second optional parameter TRUE, it means ignore NULLs.

Example (used part of your data example, timestamps fixed to be 24hrs-based to be sortable), you can add partition by and conditions checking, etc to adopt it, this demo is minimal example to show how it works:

with mytable as
(
select 364920063  id_agent ,'2020-09-16 06:38:47' interaction_start_time union all
select NULL        ,'2020-09-16 12:37:54' union all
select 364920063   ,'2020-09-17 08:02:07' union all
select 191918955   ,'2020-09-17 08:16:36' union all
select NULL        ,'2020-09-17 11:16:47' union all
select NULL        ,'2020-09-17 16:02:35' union all
select 191918955   ,'2020-09-20 02:13:14'
)

select id_agent, interaction_start_time,
       --get first not null value in the frame between current row and unbounded following 
       first_value(id_agent,true) over(order by interaction_start_time rows between current row and unbounded following) agent
from mytable

Result:

id_agent    interaction_start_time  agent
364920063   2020-09-16 06:38:47     364920063
NULL        2020-09-16 12:37:54     364920063
364920063   2020-09-17 08:02:07     364920063
191918955   2020-09-17 08:16:36     191918955
NULL        2020-09-17 11:16:47     191918955
NULL        2020-09-17 16:02:35     191918955
191918955   2020-09-20 02:13:14     191918955

Upvotes: 1

Related Questions