Reputation: 1345
INPUT:
aaa T
aaa F
aaa F
bbb T
bbb T
ccc F
ccc F
OUTPUT
COL1 TOTAL_ROWS_IN_INPUT_TABLE PERCENTAGE_TRUE_IN_INPUT_TABLE
aaa 3 33%
bbb 2 100%
ccc 2 0%
How would I do this using PIG (LATIN)?
Upvotes: 1
Views: 4082
Reputation: 3619
In Pig 0.10 SUM(INPUT.col2) does not work and casting to boolean is not possible as it treats INPUT.col2 as a bag of boolean and bag is not a primitive type. Another thing is that if the input data for col2 is specified as boolean, than dump of the input does not have any values for the col2, but treating it as a chararray works just fine.
Pig is well suited for this type of tasks as it has means to work with individual groups by using operators nested in a FOREACH. Here is the solution which works:
inpt = load '....' as (col1 : chararray, col2 : chararray);
grp = group inpt by col1; -- creates bags for each value in col1
result = foreach grp {
total = COUNT(inpt);
t = filter inpt by col2 == 'T'; --create a bag which contains only T values
generate flatten(group) as col1, total as TOTAL_ROWS_IN_INPUT_TABLE, 100*(double)COUNT(t)/(double)total as PERCENTAGE_TRUE_IN_INPUT_TABLE;
};
dump result;
Output:
(aaa,3,33.333333333333336)
(bbb,2,100.0)
(ccc,2,0.0)
Upvotes: 3
Reputation: 5801
When you COUNT
the number of records for each key in col1
, you should count the number of true
values at the same time. This way the entire thing takes place in one MapReduce job.
grouped = group INPUT by col1
OUTPUT = foreach grouped generate group, COUNT(INPUT), (double)SUM(INPUT.col2)/COUNT(INPUT);
I am stuck with Pig 0.9 on a legacy system, so I am not familiar with the new boolean type. If it is possible to SUM
over booleans, then that should be sufficient. Otherwise, you will need to translate the booleans into 1s and 0s with a simple foreach
/generate
first.
Upvotes: 0