Reputation: 47
I have one table loaded twice to perform a self join called current and previous. Both contain columns "key" (not unique) and "value". I have grouped by key, and counted the number of values in each group of keys.
I would like to find how many more values were added to the current table compared to the previous table, but I get the error "Invalid scalar projection: cur_count : A column needs to be projected from a relation for it to be used as a scalar". I am relatively new to pig latin, so I'm unsure of what the syntax should be for performing this difference.
Please disregard syntax for the cur_count and prev_count.
cur_count = FOREACH cur_grouped GENERATE COUNT(current);
prev_count = FOREACH prev_grouped GENERATE COUNT(previous);
left_join = join current by key LEFT OUTER, previous by key-1;
difference = FOREACH left_join GENERATE key, cur_count-prev_count; //error here
dump difference;
Below are some sample data
key value
1 12
1 34
1 11
1 45
2 4
3 34
3 34
3 23
4 15
4 19
What my script does so far: it counts the number of values in each group of keys
key count
1 4
2 1
3 3
4 2
I would like to find the difference in number of values between a key and the previous key
key difference
2 -3
3 2
4 -1
Upvotes: 0
Views: 630
Reputation: 1121
Presume you have two groups grp1 and grp2 with the content you described earlier
key count
1 4
2 1
3 3
4 2
Note: I have not executed below Pig statements.
-- Generate the Ranks for two relations
grp1 = rank grp1;
grp2 = rank grp2;
-- Increment rank by 1 for each record in grp2
grp2 = foreach grp2 generate ($0+1) as rank,key,count
After these the two relations would look like below. Arranged them side by side for comparison.
Group 1 Group 2
Rank key count Rank key count
1 1 4 2 1 4
2 2 1 3 2 1
3 3 3 4 3 3
4 4 2 5 4 2
Join the two groups by RANK which would yield below output
Rank key count Rank key count
2 2 1 2 1 4
3 3 3 3 2 1
4 4 2 4 3 3
5 4 2
Now you can run another "foreach" statement that finds the difference in two count columns above.
result = FOREACH <<joined relation>> GENERATE $1 as key,($2-$5) as difference
Upvotes: 1
Reputation: 11090
cur_count and prev_count are relations and cannot be used the way you are using.You can achieve the desired output using the script below.After joining the relations with (key-1),use the columns from the relation to get the difference.
A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int);
B = GROUP A BY f1;
C = FOREACH B GENERATE group,COUNT(A);
D = FOREACH B GENERATE group,COUNT(A);
E = JOIN C BY $0,D BY ($0-1);
F = FOREACH E GENERATE $2,$3-$1;
DUMP F;
Upvotes: 1