roqds
roqds

Reputation: 191

Hive Difference between 2 records

I have records with date and state. I want to display only the records which states have changed from the previous record.

Example input

date                  state
2013-01-15 04:15:07.602 ON
2013-01-15 05:15:08.502 ON
2013-01-15 06:15:08.502 OFF
2013-01-15 07:15:08.502 ON
2013-01-15 08:15:08.502 ON
...

Output expected

date                    state
2013-01-15 04:15:07.602 ON
2013-01-15 06:15:08.502 OFF
2013-01-15 07:15:08.502 ON

I wrote a simple udf function (tmchange in my query) which returns the value if state has changed and null otherwise.

select t.date, t.state from (SELECT date,  state, tmchange(LAG(state, 1) 
over (PARTITION BY month(date) ORDER BY date), state ) as lagres  
from demo_table where state is not null  
and  year=2013) t where t.lagres is not null ;

I use the LAG function from hive windowing functions. The problem is I get error: Java heap space when I use PARTITION by month(date) . When I try PARTITION by weekofyear(date) it works but it takes really too much time probably due to the LAG function. Do you have any idea how I can optimize my query in order to meet this needs (more generally compare a record to a previous one). Is it really the LAG function to use in this case ?

Upvotes: 2

Views: 18525

Answers (1)

swdev
swdev

Reputation: 3051

Not sure you can make it go any faster unless you reduce the partition size. Can you just partition by the date instead of month or week: PARTITION BY month(date), day(date)

Why use a UDF? Wouldn't this work:

SELECT t.date, t.state 
FROM (
  SELECT
    date, 
    state, 
    LAG(state) OVER (PARTITION BY month(date) ORDER BY date) AS prev_state
  FROM demo_table 
  WHERE state IS NOT NULL and year=2013
) t 
WHERE t.prev_state != t.state ;

Upvotes: 4

Related Questions