David Webb
David Webb

Reputation: 492

Using Hive Query to Analyze Time Series Data

I have a table in Hive that shows the login time (connect_date) for every user (user_id) of our system. Here is the table structure:

hive> describe access_log;
OK
user_id                 double                  from deserializer
connect_date            timestamp               from deserializer
ip                      string                  from deserializer
logout_date             timestamp               from deserializer
session_id              string                  from deserializer

My goal is to find the avg time between logins for each user over a specific time period, lets use the last 7 days for this question.

How would the Hive Query look to achieve this analysis of the data? Thanks.

Upvotes: 2

Views: 3288

Answers (1)

Jordan Young
Jordan Young

Reputation: 356

There are a few ways of doing this:

1) You could aggregate the connect_dates into an array and write a UDF to parse an array of dates/timestamps into the average time between.

2) You could use the windowing functions introduced in hive 0.12 - this is much simpler

select
    user_id, avg(connect_date-lag_connect_date)
from
    (
    select
        user_id, connect_date, lag(connect_date) over (partition by user_id order by connect_date) as lag_connect_date
    from
        access_log
    ) a
group by
    user_id

Upvotes: 4

Related Questions