Lead and lag in Hbase

Question

I'm trying to figure out how to do the equivalent of Oracle's LEAD and LAG in Hbase or some other sort of pattern that will solve my problem. I could write a MapReduce program that does this quite easily, but I'd love to be able to exploit the fact that the data is already sorted in the way I need it to be.

My problem is as follows: I have a rowkey and a value that looks like:

(employee name + timestamp) => data:salary

So, some example data might be:

miller, bob;2010-01-14 => data:salary=90000
miller, bob;2010-11-04 => data:salary=102000
miller, bob;2011-12-03 => data:salary=107000
monty, fred;2010-04-10 => data:salary=19000
monty, fred;2011-09-09 => data:salary=24000

What I want to do is calculate the changes of salary, record by record. I want to transform the above data into differences between records:

miller, bob;2010-01-14 => data:salarydiff=90000
miller, bob;2010-11-04 => data:salarydiff=12000
miller, bob;2011-12-03 => data:salarydiff=5000
monty, fred;2010-04-10 => data:salarydiff=19000
monty, fred;2011-09-09 => data:salarydiff=5000

I'm up for changing the rowkey strategy if necessary.

Arnon Rotem-Gal-Oz · Accepted Answer

What I'd do is change the key so that the timestamp will be descending (newer salary first)

miller, bob;2011-12-03 => data:salary=107000
miller, bob;2010-11-04 => data:salary=102000
miller, bob;2010-01-14 => data:salary=90000

Now you can do a simple map job that will scan the table. Then in the map you create a new Scan to the current key. Scan.next to get the previous salary, calculate the diff and store it in a new column on the current row key
Basically in your mapper class (the one that inherits TableMapper) you override the setup method and get the configuration

@Override
protected void setup(Mapper.Context context) throws IOException,InterruptedException {
    Configuration config = context.getConfiguration();
    table = new HTable(config,);
}
Then inside the map you extract the row key from the row parmeter, create the new Scan and continue as explained above
In most cases the next record would be in the same region - occasionally it might go to another regionserver

Lead and lag in Hbase

Answers (1)

Related Questions