Reputation: 125
I am trying to characterize fractions of rows having certain properties using Apache Pig.
For example, if the data looks like:
a,15
a,16
a,17
b,3
b,16
I would like to get:
a,0.6
b,0.4
I am trying to do the following:
A = LOAD 'my file' USING PigStorage(',');
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
which gives me total = (5), but then when I attempt to use this 'total':
fractions = FOREACH (GROUP A by $0) GENERATE COUNT(A)/total;
I get an error.
Clearly COUNT() returns some kind of projection and both projections (in computing total and fractions) should be consistent. Is there a way to make this work? Or perhaps just to cast total to be a number and avoid this projection consistency requirement?
Upvotes: 0
Views: 188
Reputation: 125
For some reason the following modification of what @inquisitive-mind suggested works:
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by $0) GENERATE group as colname, COUNT(A) as cnt;
fractions = FOREACH rows GENERATE colname, cnt/(double)total.$0;
Upvotes: 0
Reputation: 11080
You will have to project and cast it to double:
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by $0) GENERATE group,COUNT(A);
fractions = FOREACH rows GENERATE rows.$0,(double)rows.$1/(double)total.$0;
Upvotes: 1
Reputation: 1687
One more way to do the same:
test = LOAD 'test.txt' USING PigStorage(',') AS (one:chararray,two:int);
B = GROUP test by $0;
C = FOREACH B GENERATE group, COUNT(test.$0);
D = GROUP test ALL;
E = FOREACH D GENERATE group,COUNT(test.$0);
F = CROSS C,E;
G = FOREACH F GENERATE $0,$1,$3,(double)($1*100/$3);
Output:
(a,3,5,0.6)
(b,2,5,0.4)
Upvotes: 1