Selecting random tuple from bag

Question

Is it possible to (efficiently) select a random tuple from a bag in pig? I can just take the first result of a bag (as it is unordered), but in my case I need a proper random selection. One (not efficient) solution is counting the number of tuples in the bag, take a random number within that range, loop through the bag, and stop whenever the number of iterations matches my random number. Does anyone know of faster/better ways to do this?

alexeipab · Accepted Answer

You could use RANDOM(), ORDER and LIMIT in a nested FOREACH statement to select one element with the smallest random number:

inpt = load 'group.txt' as (id:int, c1:bytearray, c2:bytearray);
groups = group inpt by id;
randoms = foreach groups {
    rnds = foreach inpt generate *, RANDOM() as rnd; -- assign random number to each row in the bag
    ordered_rnds = order rnds by rnd;
    one_tuple = limit ordered_rnds 1; -- select tuple with the smallest random number
    generate group as id, one_tuple;
};

dump randoms;

INPUT:

OUTPUT:

(1,{(1,b,r,0.05172709255901231)})
(2,{(2,k,k,0.14351660053632986)})
(3,{(3,e,l,0.0854104195792681)})
(4,{(4,h,b,8.906013598960483E-4)})
(5,{(5,h,k,0.6219490873384448)})

If you run "dump randoms;" multiple times, you should get different results for each run.

Writing a UDF might give you better performance as you do not need to do secondary sort on random within the bag.

Selecting random tuple from bag

Answers (2)

Related Questions