zac s
zac s

Reputation: 23

How to Query Data Associated With Minimum/Maximum in Pig

I'm looking for the coldest hour for each day. My data looks like this:

(2015/12/27,12AM,32.0)
(2015/12/27,12PM,34.0)
(2015/12/28,10AM,26.1)
(2015/12/28,10PM,28.0)
(2015/12/28,11AM,27.0)
(2015/12/28,11PM,28.9)
(2015/12/28,12AM,25.0)
(2015/12/28,12PM,26.100000000000005)
(2015/12/29,10AM,22.45)
(2015/12/29,10PM,26.1)
(2015/12/29,11AM,24.1)
(2015/12/29,11PM,25.0)
(2015/12/29,12AM,28.9)

I grouped on each day to find the Min Temp with this code:

minTemps = FOREACH gdate2 GENERATE group as day,MIN(removeDash.temp) as minTemp;

which gives this output:

(2015/12/18,17.1)
(2015/12/19,12.9)
(2015/12/20,23.0)
(2015/12/21,32.0)
(2015/12/22,30.899999999999995)
(2015/12/23,36.05)
(2015/12/24,30.45)
(2015/12/25,26.55)
(2015/12/26,28.899999999999995)
(2015/12/27,26.1)
(2015/12/28,23.55)
(2015/12/29,21.0)

My problem:I also need the hour at which the minimum temp occurred. How can I get the hour as well?

Upvotes: 0

Views: 51

Answers (2)

savagedata
savagedata

Reputation: 722

If I'm understanding your question correctly, grouping by (day, hour) won't work because this finds the coldest temperature for each hour, not the coldest hour and temperature for each day.

Instead, use a nested foreach:

B = GROUP A BY day;
C = FOREACH B {
    orderd = ORDER A BY temp ASC;
    limitd = LIMIT orderd 1;
    GENERATE FLATTEN(limitd) AS (day, hour, temp); 
};

Group by day as you did before, then order all the hours within the same day by temperature and select only the top record. Just be aware that if there is a tie between two or more hours, only one of these hours will be selected.

Upvotes: 1

nobody
nobody

Reputation: 11080

Yes, you are on the right track.Modify your group statement to group by day and hour.Finally use FLATTEN on your group decouple the keys.

gdate2 = GROUP removeDash by (day,hour);
minTemps = FOREACH gdate2 GENERATE FLATTEN(group) as (day,hour),MIN(removeDash.temp) as minTemp;

Upvotes: 0

Related Questions