Reputation: 59
I have some census data in which each line has a number denoting the county and fields for the number of people in a certain age range (eg, 5 and under, 5 to 17, etc.). After some initial processing in which I removed the unneeded columns, I grouped the labeled data as follows (labeled_data is of the schema {county: chararray,pop1: int,pop2: int,pop3: int,pop4: int,pop5: int,pop6: int,pop7: int,pop8: int}):
grouped_data = GROUP filtered_data BY county;
So grouped_data is of the schema
{group: chararray,filtered_data: {(county: chararray,pop1: int,pop2: int,pop3: int,pop4: int,pop5: int,pop6: int,pop7: int,pop8: int)}}
Now I would like to to sum up all of the pop fields for each county, yielding the total population of each county. I'm pretty sure the command to do this will be of the form
pop_sums = FOREACH grouped_data GENERATE group, SUM(something about the pop fields);
but I've been unable to get this to work. Thanks in advance!
I don't know if this is helpful, but the following is a representative entry of grouped_data:
(147,{(147,385,1005,283,468,649,738,933,977),(147,229,655,178,288,394,499,579,481)})
Note that the 147 entries are actually county codes, not populations. They are therefore of type chararray.
Upvotes: 1
Views: 111
Reputation: 4724
Can you try the below approach?
Sample input:
147,1,1,1,1,1,1,1,1
147,2,2,2,2,2,2,2,2
145,5,5,5,5,5,5,5,5
PigScript:
A = LOAD 'input' USING PigStorage(',') AS(country:chararray,pop1:int,pop2:int,pop3:int,pop4:int,pop5:int,pop6:int,pop7:int,pop8:int);
B = GROUP A BY country;
C = FOREACH B GENERATE group,(SUM(A.pop1)+SUM(A.pop2)+SUM(A.pop3)+SUM(A.pop4)+SUM(A.pop5)+SUM(A.pop6)+SUM(A.pop7)+SUM(A.pop8)) AS totalPopulation;
DUMP C;
Output:
(145,40)
(147,24)
Upvotes: 1