tu_1329
tu_1329

Reputation: 25

Find 5 top popular based on sum in Pig Script

I'm trying to find the top 3 most popular locations with the greatest tripCount. So I need to see the total of tripCount per location and return the greatest n...

My data is as follow:

LocationID  tripCount  tripDistance
101            40           4.6
203            29           1.3
56             25           9.3
101            17           4.5
66             5            1.1
13             5            0.5
203            10           1.2
558            8            0.5
56             10           5.5 

So the result I'm expecting is:

 101     57 
 203     39
 56      35

So far my code is:

B = GROUP UNION_DATA BY DOLocationID;
C = FOREACH B {                          
DA = ORDER UNION_DATA BY passenger_count DESC;                
DB = LIMIT DA 5;                         
GENERATE FLATTEN(group), FLATTEN(DB.LocationID), FLATTEN(DB.dropoff_datetime);
}

What am I missing and what do I need to do to get the expected result?

Upvotes: 0

Views: 134

Answers (1)

yammanuruarun
yammanuruarun

Reputation: 403

Below piece of code should get you desired results. I broke down the statement into simple chunks for better understanding and readability.Also your alias and code provided seems incomplete so i completely re-wrote from scratch.

LocationID,tripCount,tripDistance

cat > trip_data.txt 
    101,40,4.6
    203,29,1.3 
    56,25,9.3 
    101,17,4.5 
    66,5,1.1 
    13,5,0.5 
    203,10,1.2 
    558,8,0.5
    56,10,5.5

PIG Code:

A = load '/home/ec2-user/trip_data.txt' using PigStorage(',') as (LocationID,tripCount,tripDistance);
    describe A;
    B = GROUP A BY LocationID;
    describe B;
    dump B;
    C = FOREACH B GENERATE group, SUM(A.tripCount);
    describe C;
    dump C;
    D = ORDER C BY $1 DESC;
    describe D;
    dump D;
    RESULT = LIMIT D 3;
    describe RESULT;
    dump RESULT;

Upvotes: 1

Related Questions