Reputation: 39913
I'd like to build random samples of a HBase table's rowkey space.
Say, I'd like to have roughly 1% of the keys from HBase that are randomly distributed across the table. What's the best way of doing this?
I suppose I could write a MapReduce job that processed all the data and pulled 1/100 of the keys... or perhaps use a coprocessor.
Upvotes: 3
Views: 905
Reputation: 39913
I ended up doing this in Pig but for whatever reason it was dreadfully slow. I got the data I needed so I didn't go further, but I should probably try Alexander's answer.
data = LOAD 'hbase://MARS1'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'A:*', '-loadKey true')
AS (id:bytearray, A_map:map[]);
justkeys = FOREACH data GENERATE id;
-- rough estimate of number of keys in hbase table
smp = SAMPLE justkeys 0.000001;
STORE smp INTO 'key_sample' USING PigStorage('\t');
Upvotes: 0