Reputation: 837
In this raw data we have info of baseball players, the schema is:
name:chararray, team:chararray, position:bag{t:(p:chararray)}, bat:map[]
Using the following script we are able to list out players and the different positions they have played. How do we get a count of how many players have played a particular position? E.G. How many players were in the 'Designated_hitter' position?
A single position can't appear multiple times in position
bag for a player.
Pig Script and output for the sample data is listed below.
--pig script
players = load 'baseball' as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
pos = foreach players generate name, flatten(position) as position;
groupbyposition = group pos by position;dump groupbyposition;
--dump groupbyposition (output of one position i.e Designated_hitter)
(Designated_hitter,{(Michael Young,Designated_hitter)})
Upvotes: 0
Views: 3391
Reputation: 5186
From what I can tell you've already done all of the 'grunt' (Ha!, Pig joke) work. All there it left to do is use COUNT
on the output of the GROUP BY
. Something like:
groupbyposition = group pos by position ;
pos_count = FOREACH groupbyposition GENERATE group AS position, COUNT(pos) ;
Note: Using UDFs you may be able to get a more efficient solution. If you care about counting a certain few fields then it should be more efficient to filter the postion
bag before hand (This is why I said UDF, I forgot you could just use a nested FILTER
). For example:
pos = FOREACH players {
-- you can also add the DISTINCT that alexeipab points out here
-- make sure to change postion in the FILTER to dist!
-- dist = DISTINCT position ;
filt = FILTER postion BY p MATCHES 'Designated_hitter|etc.' ;
GENERATE name, FLATTEN(filt) ;
}
If none of the positions you want appear in postion
then it will create an empty bag. When empty bags are FLATTEN
ed the row is discarded. This means you'll be FLATTEN
ing bags of N or less elements (where N is the number of fields you want) instead of 7-15 (didn't really look at the data that closely), and the GROUP
will be on significantly less data.
Notes: I'm not sure if this will be significantly faster (if at all). Also, using a UDF to preform the nested FILTER
may be faster.
Upvotes: 2
Reputation: 3619
You can use nested DISTINCT to get the list of players and than count it.
players = load 'baseball' as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
pos = foreach players generate name, flatten(position) as position;
groupbyposition = group pos by position;
pos_count = foreach groupbyposition generate {
players = DISTINCT name;
generate group, COUNT(players) as num, pos;
}
Upvotes: 0