NPE
NPE

Reputation: 1421

Apache Pig: Join records by shifting

I have records of type:

time | url
==========
34 google.com 
42 cnn.com
54 yahoo.com
64 fb.com

I want to add another column to these records time_diff which basically takes the difference of the time of the current record with the previous record. Output should look like:

time | url | time_diff
======================
34 google.com -- <can drop this row>
42 cnn.com    08
54 yahoo.com  12
64 fb.com     10

If I can somehow add another column (same as time) shifting the time by one such that 42 is aligned with 34, 54 is aligned with 42 and so on, then I can take the difference between these columns to calculate time_diff column.

I can project the time column to a new variable T and if I can drop the first record in the original data, then I can join it with T to obtain the desired result.

I appreciate any help. Thanks!

Upvotes: 1

Views: 460

Answers (2)

Gubi
Gubi

Reputation: 415

I think you can use "lead" function of PiggyBank. Something like following might work.

 A = LOAD 'T';

 B = GROUP A ALL

 C = FOREACH B {
     C1 = ORDER A BY d;
     GENERATE FLATTEN(Stitch(C1, Over(C1.time, 'lead')));
 }

 D = FOREACH C
    GENERATE stitched::time AS time,
            stitched::url AS url,
            stitched::time - $3 AS time_diff;

https://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/evaluation/Over.html

Upvotes: 0

reo katoa
reo katoa

Reputation: 5801

See this question, for example. You'll need to get your tuples in a bag (using GROUP ... ALL in your case), and then in a nested FOREACH, ORDER them and call a UDF to rank them. After you have this rank, you can FLATTEN the bag back out into a set of tuples again, and you'll have three fields: time, url, and rank. Once you have this, create a fourth column which is rank-1, do a self-join on those latter two columns, and you'll have what you need to compute the time_diff.

Since multiple records can have the same time, it would be a good idea to also sort on url so that you are guaranteed the same result every time.

Upvotes: 3

Related Questions