Huzefa
Huzefa

Reputation: 119

HIVE: Finding running totals

I have a Table called Program which have following columns:

ProgDate(Date)
Episode(String)
Impression_id(int)
ProgName(String)

I want to find out total impressions for each date and episode for which i have the below query which is working fine

Select progdate, episode, count(distinct impression_id) Impression from Program where progname='BBC' group by progdate, episode order by progdate, episode;
Result:
ProgDate        Episode     Impression      
20160919        1       5           
20160920        1       15          
20160921        1       10          
20160922        1       5           
20160923        2       25          
20160924        2       10          
20160925        2       25          

But I also want to find out the cumulative total for each episode. I tried searching on how to find running total but it is adding up all the previous totals. i want running total for each episode, like below:

Date        Episode     Impression  CumulativeImpressionsPerChannel     
20160919        1       5               5
20160920        1       15              20
20160921        1       10              30
20160922        1       5               35
20160923        2       25              25
20160924        2       10              35
20160925        2       25              60

Upvotes: 0

Views: 5860

Answers (1)

Paul Maxwell
Paul Maxwell

Reputation: 35563

Recent versions of Hive HQL support windowed analytic functions (ref 1) (ref 2) including SUM() OVER()

Assuming you have such a version I have mimicked the syntax using PostgreSQL at SQL Fiddle

CREATE TABLE d
    (ProgDate int, Episode int, Impression int)
;

INSERT INTO d
    (ProgDate, Episode, Impression)
VALUES
    (20160919, 1, 5),
    (20160920, 1, 15),
    (20160921, 1, 10),
    (20160922, 1, 5),
    (20160923, 2, 25),
    (20160924, 2, 10),
    (20160925, 2, 25)
;

Query 1:

select
      ProgDate, Episode, Impression
    , sum(Impression) over(partition by Episode order by ProgDate) CumImpsPerChannel 
    , sum(Impression) over(order by ProgDate) CumOverall
from (
       Select progdate, episode, count(distinct impression_id) Impression 
       from Program 
       where progname='BBC' 
       group by progdate, episode order by progdate, episode
      ) d

Results:

| progdate | episode | impression | cumimpsperchannel |
|----------|---------|------------|-------------------|
| 20160919 |       1 |          5 |                 5 |
| 20160920 |       1 |         15 |                20 |
| 20160921 |       1 |         10 |                30 |
| 20160922 |       1 |          5 |                35 |
| 20160923 |       2 |         25 |                25 |
| 20160924 |       2 |         10 |                35 |
| 20160925 |       2 |         25 |                60 |

Upvotes: 2

Related Questions