Reputation: 1421
I have records of type:
time | url
==========
34 google.com
42 cnn.com
54 yahoo.com
64 fb.com
I want to add another column to these records time_diff
which basically takes the difference of the time of the current record with the previous record. Output should look like:
time | url | time_diff
======================
34 google.com -- <can drop this row>
42 cnn.com 08
54 yahoo.com 12
64 fb.com 10
If I can somehow add another column (same as time
) shifting the time
by one such that 42 is aligned with 34, 54 is aligned with 42 and so on, then I can take the difference between these columns to calculate time_diff
column.
I can project the time
column to a new variable T
and if I can drop the first record in the original data, then I can join it with T
to obtain the desired result.
I appreciate any help. Thanks!
Upvotes: 1
Views: 460
Reputation: 415
I think you can use "lead" function of PiggyBank. Something like following might work.
A = LOAD 'T';
B = GROUP A ALL
C = FOREACH B {
C1 = ORDER A BY d;
GENERATE FLATTEN(Stitch(C1, Over(C1.time, 'lead')));
}
D = FOREACH C
GENERATE stitched::time AS time,
stitched::url AS url,
stitched::time - $3 AS time_diff;
https://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/evaluation/Over.html
Upvotes: 0
Reputation: 5801
See this question, for example. You'll need to get your tuples in a bag (using GROUP ... ALL
in your case), and then in a nested FOREACH
, ORDER
them and call a UDF to rank them. After you have this rank, you can FLATTEN
the bag back out into a set of tuples again, and you'll have three fields: time
, url
, and rank
. Once you have this, create a fourth column which is rank-1
, do a self-join on those latter two columns, and you'll have what you need to compute the time_diff
.
Since multiple records can have the same time
, it would be a good idea to also sort on url
so that you are guaranteed the same result every time.
Upvotes: 3