Reputation: 25
I'm trying to find the top 3 most popular locations with the greatest tripCount.
So I need to see the total of tripCount
per location and return the greatest n...
My data is as follow:
LocationID tripCount tripDistance
101 40 4.6
203 29 1.3
56 25 9.3
101 17 4.5
66 5 1.1
13 5 0.5
203 10 1.2
558 8 0.5
56 10 5.5
So the result I'm expecting is:
101 57
203 39
56 35
So far my code is:
B = GROUP UNION_DATA BY DOLocationID;
C = FOREACH B {
DA = ORDER UNION_DATA BY passenger_count DESC;
DB = LIMIT DA 5;
GENERATE FLATTEN(group), FLATTEN(DB.LocationID), FLATTEN(DB.dropoff_datetime);
}
What am I missing and what do I need to do to get the expected result?
Upvotes: 0
Views: 134
Reputation: 403
Below piece of code should get you desired results. I broke down the statement into simple chunks for better understanding and readability.Also your alias and code provided seems incomplete so i completely re-wrote from scratch.
LocationID,tripCount,tripDistance
cat > trip_data.txt
101,40,4.6
203,29,1.3
56,25,9.3
101,17,4.5
66,5,1.1
13,5,0.5
203,10,1.2
558,8,0.5
56,10,5.5
PIG Code:
A = load '/home/ec2-user/trip_data.txt' using PigStorage(',') as (LocationID,tripCount,tripDistance);
describe A;
B = GROUP A BY LocationID;
describe B;
dump B;
C = FOREACH B GENERATE group, SUM(A.tripCount);
describe C;
dump C;
D = ORDER C BY $1 DESC;
describe D;
dump D;
RESULT = LIMIT D 3;
describe RESULT;
dump RESULT;
Upvotes: 1