Reputation: 191
I have records with date and state. I want to display only the records which states have changed from the previous record.
Example input
date state
2013-01-15 04:15:07.602 ON
2013-01-15 05:15:08.502 ON
2013-01-15 06:15:08.502 OFF
2013-01-15 07:15:08.502 ON
2013-01-15 08:15:08.502 ON
...
Output expected
date state
2013-01-15 04:15:07.602 ON
2013-01-15 06:15:08.502 OFF
2013-01-15 07:15:08.502 ON
I wrote a simple udf
function (tmchange in my query) which returns the value if state has changed and null otherwise.
select t.date, t.state from (SELECT date, state, tmchange(LAG(state, 1)
over (PARTITION BY month(date) ORDER BY date), state ) as lagres
from demo_table where state is not null
and year=2013) t where t.lagres is not null ;
I use the LAG
function from hive windowing functions. The problem is I get error: Java heap space
when I use PARTITION by month(date)
. When I try PARTITION by weekofyear(date)
it works but it takes really too much time probably due to the LAG
function.
Do you have any idea how I can optimize my query in order to meet this needs (more generally compare a record to a previous one). Is it really the LAG
function to use in this case ?
Upvotes: 2
Views: 18525
Reputation: 3051
Not sure you can make it go any faster unless you reduce the partition size. Can you just partition by the date instead of month or week: PARTITION BY month(date), day(date)
Why use a UDF? Wouldn't this work:
SELECT t.date, t.state
FROM (
SELECT
date,
state,
LAG(state) OVER (PARTITION BY month(date) ORDER BY date) AS prev_state
FROM demo_table
WHERE state IS NOT NULL and year=2013
) t
WHERE t.prev_state != t.state ;
Upvotes: 4