gaurav jain
gaurav jain

Reputation: 1345

Calculating percentage in a pig query

INPUT:

aaa T
aaa F
aaa F
bbb T
bbb T
ccc F
ccc F

OUTPUT

COL1   TOTAL_ROWS_IN_INPUT_TABLE   PERCENTAGE_TRUE_IN_INPUT_TABLE
aaa     3                          33%
bbb     2                          100%
ccc     2                          0%

How would I do this using PIG (LATIN)?

Upvotes: 1

Views: 4082

Answers (2)

alexeipab
alexeipab

Reputation: 3619

In Pig 0.10 SUM(INPUT.col2) does not work and casting to boolean is not possible as it treats INPUT.col2 as a bag of boolean and bag is not a primitive type. Another thing is that if the input data for col2 is specified as boolean, than dump of the input does not have any values for the col2, but treating it as a chararray works just fine.

Pig is well suited for this type of tasks as it has means to work with individual groups by using operators nested in a FOREACH. Here is the solution which works:

inpt = load '....' as (col1 : chararray, col2 : chararray);
grp = group inpt by col1; -- creates bags for each value in col1
result = foreach grp {
    total = COUNT(inpt);
    t = filter inpt by col2 == 'T'; --create a bag which contains only T values
    generate flatten(group) as col1, total as  TOTAL_ROWS_IN_INPUT_TABLE, 100*(double)COUNT(t)/(double)total as PERCENTAGE_TRUE_IN_INPUT_TABLE;
};

dump result;

Output:

(aaa,3,33.333333333333336)
(bbb,2,100.0)
(ccc,2,0.0)

Upvotes: 3

reo katoa
reo katoa

Reputation: 5801

When you COUNT the number of records for each key in col1, you should count the number of true values at the same time. This way the entire thing takes place in one MapReduce job.

grouped = group INPUT by col1
OUTPUT = foreach grouped generate group, COUNT(INPUT), (double)SUM(INPUT.col2)/COUNT(INPUT);

I am stuck with Pig 0.9 on a legacy system, so I am not familiar with the new boolean type. If it is possible to SUM over booleans, then that should be sufficient. Otherwise, you will need to translate the booleans into 1s and 0s with a simple foreach/generate first.

Upvotes: 0

Related Questions