Apache Pig: Join records by shifting

Question

I have records of type:

time | url
==========
34 google.com 
42 cnn.com
54 yahoo.com
64 fb.com

I want to add another column to these records time_diff which basically takes the difference of the time of the current record with the previous record. Output should look like:

time | url | time_diff
======================
34 google.com -- 
42 cnn.com    08
54 yahoo.com  12
64 fb.com     10

If I can somehow add another column (same as time) shifting the time by one such that 42 is aligned with 34, 54 is aligned with 42 and so on, then I can take the difference between these columns to calculate time_diff column.

I can project the time column to a new variable T and if I can drop the first record in the original data, then I can join it with T to obtain the desired result.

I appreciate any help. Thanks!

reo katoa · Accepted Answer

See this question, for example. You'll need to get your tuples in a bag (using GROUP ... ALL in your case), and then in a nested FOREACH, ORDER them and call a UDF to rank them. After you have this rank, you can FLATTEN the bag back out into a set of tuples again, and you'll have three fields: time, url, and rank. Once you have this, create a fourth column which is rank-1, do a self-join on those latter two columns, and you'll have what you need to compute the time_diff.

Since multiple records can have the same time, it would be a good idea to also sort on url so that you are guaranteed the same result every time.

Apache Pig: Join records by shifting

Answers (2)

Related Questions