Vikram
Vikram

Reputation: 837

Pig: Counting the occurence of a grouped column

In this raw data we have info of baseball players, the schema is:

name:chararray, team:chararray, position:bag{t:(p:chararray)}, bat:map[]

Using the following script we are able to list out players and the different positions they have played. How do we get a count of how many players have played a particular position? E.G. How many players were in the 'Designated_hitter' position?

A single position can't appear multiple times in position bag for a player.

Pig Script and output for the sample data is listed below.

--pig script
players = load 'baseball' as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
pos     = foreach players generate name, flatten(position) as position;
groupbyposition = group pos by position;dump groupbyposition;

--dump groupbyposition (output of one position i.e Designated_hitter)
(Designated_hitter,{(Michael Young,Designated_hitter)})

Upvotes: 0

Views: 3391

Answers (2)

mr2ert
mr2ert

Reputation: 5186

From what I can tell you've already done all of the 'grunt' (Ha!, Pig joke) work. All there it left to do is use COUNT on the output of the GROUP BY. Something like:

groupbyposition = group pos by position ;
pos_count = FOREACH groupbyposition GENERATE group AS position, COUNT(pos) ;

Note: Using UDFs you may be able to get a more efficient solution. If you care about counting a certain few fields then it should be more efficient to filter the postion bag before hand (This is why I said UDF, I forgot you could just use a nested FILTER). For example:

pos = FOREACH players {
              -- you can also add the DISTINCT that alexeipab points out here
              -- make sure to change postion in the FILTER to dist!
              -- dist = DISTINCT position ;
              filt = FILTER postion BY p MATCHES 'Designated_hitter|etc.' ;
      GENERATE name, FLATTEN(filt) ;
}

If none of the positions you want appear in postion then it will create an empty bag. When empty bags are FLATTENed the row is discarded. This means you'll be FLATTENing bags of N or less elements (where N is the number of fields you want) instead of 7-15 (didn't really look at the data that closely), and the GROUP will be on significantly less data.

Notes: I'm not sure if this will be significantly faster (if at all). Also, using a UDF to preform the nested FILTER may be faster.

Upvotes: 2

alexeipab
alexeipab

Reputation: 3619

You can use nested DISTINCT to get the list of players and than count it.

players = load 'baseball' as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
pos     = foreach players generate name, flatten(position) as position;
groupbyposition = group pos by position;
pos_count = foreach groupbyposition generate {
     players = DISTINCT name;
     generate group, COUNT(players) as num, pos;
}

Upvotes: 0

Related Questions