Reputation: 403
the title might be a little confusing, so i will show what i want to achieve.
so say that I have a data with just int.
10
20
30
40
50
60
70
80
90
and the data is called data.csv or something
so i do
A = load 'data.csv' using PigStorage(',');
and it will load it to A
and then I use this data and calculate the average of it.
which i do
B = foreach A generate int;
C = group B all;
avg = foreach C generate AVG(B.int);
(ignore the little syntax errors, you get the point)
so if i dump avg, i will get a single integer representing the average of the data A.
So, now what i want to do is
filter out the data A by only having data that are higher than the average.
So something like this
X = filter A by int > avg
but it doesnt like me using a data variable to a filter comparison.
how should i achieve this ?
Upvotes: 0
Views: 669
Reputation: 5801
Generate your original data along with the average and then filter:
A = load 'data.csv' using PigStorage(',');
B = foreach A generate int;
C = group B all;
D = foreach C generate FLATTEN(B.int), AVG(B.int) AS avg;
E = filter D by int > avg;
Relation D
will be all of your original rows with the average appended as a second field.
Upvotes: 2