JaemyeongEo
JaemyeongEo

Reputation: 403

using resulting data in filter for another data, Hadoop Pig.

the title might be a little confusing, so i will show what i want to achieve.

so say that I have a data with just int.

10
20
30
40
50
60
70
80
90

and the data is called data.csv or something

so i do

A = load 'data.csv' using PigStorage(',');

and it will load it to A

and then I use this data and calculate the average of it.

which i do

B = foreach A generate int;
C = group B all;
avg = foreach C generate AVG(B.int);

(ignore the little syntax errors, you get the point)

so if i dump avg, i will get a single integer representing the average of the data A.

So, now what i want to do is

filter out the data A by only having data that are higher than the average.

So something like this

X = filter A by int > avg

but it doesnt like me using a data variable to a filter comparison.

how should i achieve this ?

Upvotes: 0

Views: 669

Answers (1)

reo katoa
reo katoa

Reputation: 5801

Generate your original data along with the average and then filter:

A = load 'data.csv' using PigStorage(',');
B = foreach A generate int;
C = group B all;
D = foreach C generate FLATTEN(B.int), AVG(B.int) AS avg;
E = filter D by int > avg;

Relation D will be all of your original rows with the average appended as a second field.

Upvotes: 2

Related Questions