Reputation: 1687
I have the below requirements.
Test data has the following values.
I need to find the percentage of each of the characters out of the total.
I have tried with the below query , but not to success.
Ex:
W
H
U
U
H
W
U
W
W
H
W
U
H
H
H
U
W
W
W
H
data = LOAD 'location of test data';
grp = GROUP data BY data.$0; // considering only 1 field in this csv.
result = FOREACH grp GENERATE group, COUNT(data.$0)/SUM(data.$0);
Since the fields are chararrays
, I am not able to do the sum of the fields.
Is there an alternate to find one?
If I use a GROUP ALL, followed by COUNT(data.$0), I get the total number of entries.
If I use a GROUP of the field, followed by COUNT(data.$0), I get the individual count.
Here i need the percentage of this individual count by the sum.
Thanks in advance.
Upvotes: 1
Views: 423
Reputation: 1121
Here i need the percentage of this individual count by the sum.
To do this, you would need to run two Pig Operations I believe - 1) First as you said get the individual counts in one relation
W 8
H 7
U 5
2) Second, you count all the elements as you mentioned earlier in one relation
total 20
3) You then need to CROSS the relations obtained in first and two (CROSS) so that you have a new relation like this
W 8 20
H 7 20
U 5 20
4) Post this, you can calculate the percentage that you wanted.
Update
Below is the Pig script that I came up with.
A = LOAD 'data.txt' using PigStorage('\n');
--DUMP A;
B = GROUP A by $0;
C = FOREACH B GENERATE group, COUNT(A.$0);
--DUMP C;
D = GROUP A ALL;
E = FOREACH D GENERATE group,COUNT(A.$0);
DUMP E;
DESCRIBE C;
DESCRIBE E;
F = CROSS C,E;
G = FOREACH F GENERATE $0,$1,$3,($1*100/$3);
DESCRIBE G;
DUMP G;
Upvotes: 1
Reputation: 19375
you have to do that manually,
something like
data = foreach data generate *, ((B=='b1')?1:0) AS dummy_b1;
data = foreach data generate *, mean(dummy_b1) AS percentage;
Upvotes: 0