Tnatsissa H Craeser
Tnatsissa H Craeser

Reputation: 125

Basic statistics with Apache Pig

I am trying to characterize fractions of rows having certain properties using Apache Pig.

For example, if the data looks like:

    a,15
    a,16
    a,17
    b,3
    b,16

I would like to get:

    a,0.6
    b,0.4

I am trying to do the following:

    A = LOAD 'my file' USING PigStorage(',');
    total = FOREACH (GROUP A ALL) GENERATE COUNT(A);

which gives me total = (5), but then when I attempt to use this 'total':

    fractions = FOREACH (GROUP A by $0) GENERATE COUNT(A)/total;

I get an error.

Clearly COUNT() returns some kind of projection and both projections (in computing total and fractions) should be consistent. Is there a way to make this work? Or perhaps just to cast total to be a number and avoid this projection consistency requirement?

Upvotes: 0

Views: 188

Answers (3)

Tnatsissa H Craeser
Tnatsissa H Craeser

Reputation: 125

For some reason the following modification of what @inquisitive-mind suggested works:

  total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
  rows = FOREACH (GROUP A by $0) GENERATE group as colname, COUNT(A) as cnt;
  fractions = FOREACH rows GENERATE colname, cnt/(double)total.$0;

Upvotes: 0

nobody
nobody

Reputation: 11080

You will have to project and cast it to double:

total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by $0) GENERATE group,COUNT(A);
fractions = FOREACH rows GENERATE rows.$0,(double)rows.$1/(double)total.$0;

Upvotes: 1

Keshav Pradeep Ramanath
Keshav Pradeep Ramanath

Reputation: 1687

One more way to do the same:

test = LOAD 'test.txt' USING PigStorage(',') AS (one:chararray,two:int);
B = GROUP test by $0;
C = FOREACH B GENERATE group, COUNT(test.$0);
D = GROUP test ALL;
E = FOREACH D GENERATE group,COUNT(test.$0);
F = CROSS C,E;
G = FOREACH F GENERATE $0,$1,$3,(double)($1*100/$3);

Output:
(a,3,5,0.6)
(b,2,5,0.4)

Upvotes: 1

Related Questions