Shx
Shx

Reputation: 47

Pig Latin Count difference between two tables

I have one table loaded twice to perform a self join called current and previous. Both contain columns "key" (not unique) and "value". I have grouped by key, and counted the number of values in each group of keys.

I would like to find how many more values were added to the current table compared to the previous table, but I get the error "Invalid scalar projection: cur_count : A column needs to be projected from a relation for it to be used as a scalar". I am relatively new to pig latin, so I'm unsure of what the syntax should be for performing this difference.

Please disregard syntax for the cur_count and prev_count.

cur_count = FOREACH cur_grouped GENERATE COUNT(current);
prev_count = FOREACH prev_grouped GENERATE COUNT(previous);

left_join = join current by key LEFT OUTER, previous by key-1;
difference = FOREACH left_join GENERATE key, cur_count-prev_count; //error here
dump difference;

Below are some sample data

key     value
1         12
1         34
1         11
1         45
2          4
3         34
3         34
3         23
4         15
4         19

What my script does so far: it counts the number of values in each group of keys

key     count
1         4
2         1
3         3
4         2

I would like to find the difference in number of values between a key and the previous key

key     difference 
 2          -3
 3           2
 4          -1  

Upvotes: 0

Views: 630

Answers (2)

Amit
Amit

Reputation: 1121

Presume you have two groups grp1 and grp2 with the content you described earlier

key     count
1         4
2         1
3         3
4         2

Note: I have not executed below Pig statements.

-- Generate the Ranks for two relations
grp1 = rank grp1;
grp2 = rank grp2;
-- Increment rank by 1 for each record in grp2
grp2 = foreach grp2 generate ($0+1) as rank,key,count

After these the two relations would look like below. Arranged them side by side for comparison.

Group 1                  Group 2
Rank key     count    Rank   key     count
1     1         4       2     1         4
2     2         1       3     2         1
3     3         3       4     3         3  
4     4         2       5     4         2  

Join the two groups by RANK which would yield below output

Rank  key     count    Rank   key     count
2      2         1      2     1         4
3      3         3      3     2         1
4      4         2      4     3         3  
                        5     4         2 

Now you can run another "foreach" statement that finds the difference in two count columns above.

result = FOREACH <<joined relation>> GENERATE $1 as key,($2-$5) as difference

Upvotes: 1

nobody
nobody

Reputation: 11090

cur_count and prev_count are relations and cannot be used the way you are using.You can achieve the desired output using the script below.After joining the relations with (key-1),use the columns from the relation to get the difference.

A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int);
B = GROUP A BY f1;
C = FOREACH B GENERATE group,COUNT(A);
D = FOREACH B GENERATE group,COUNT(A);
E = JOIN C BY $0,D BY ($0-1);
F = FOREACH E GENERATE $2,$3-$1;
DUMP F;

Output

Upvotes: 1

Related Questions