BobbyPin
BobbyPin

Reputation: 59

Pig: Summing Fields

I have some census data in which each line has a number denoting the county and fields for the number of people in a certain age range (eg, 5 and under, 5 to 17, etc.). After some initial processing in which I removed the unneeded columns, I grouped the labeled data as follows (labeled_data is of the schema {county: chararray,pop1: int,pop2: int,pop3: int,pop4: int,pop5: int,pop6: int,pop7: int,pop8: int}):

grouped_data = GROUP filtered_data BY county;

So grouped_data is of the schema

{group: chararray,filtered_data: {(county: chararray,pop1: int,pop2: int,pop3: int,pop4: int,pop5: int,pop6: int,pop7: int,pop8: int)}}

Now I would like to to sum up all of the pop fields for each county, yielding the total population of each county. I'm pretty sure the command to do this will be of the form

pop_sums = FOREACH grouped_data GENERATE group, SUM(something about the pop fields);

but I've been unable to get this to work. Thanks in advance!

I don't know if this is helpful, but the following is a representative entry of grouped_data:

(147,{(147,385,1005,283,468,649,738,933,977),(147,229,655,178,288,394,499,579,481)})

Note that the 147 entries are actually county codes, not populations. They are therefore of type chararray.

Upvotes: 1

Views: 111

Answers (1)

Sivasakthi Jayaraman
Sivasakthi Jayaraman

Reputation: 4724

Can you try the below approach?

Sample input:

147,1,1,1,1,1,1,1,1
147,2,2,2,2,2,2,2,2
145,5,5,5,5,5,5,5,5

PigScript:

A = LOAD 'input' USING PigStorage(',') AS(country:chararray,pop1:int,pop2:int,pop3:int,pop4:int,pop5:int,pop6:int,pop7:int,pop8:int);
B = GROUP A BY country;
C = FOREACH B GENERATE group,(SUM(A.pop1)+SUM(A.pop2)+SUM(A.pop3)+SUM(A.pop4)+SUM(A.pop5)+SUM(A.pop6)+SUM(A.pop7)+SUM(A.pop8)) AS totalPopulation;
DUMP C;

Output:

(145,40)
(147,24)

Upvotes: 1

Related Questions