Donald Miner
Donald Miner

Reputation: 39913

Sampling HBase table keyspace

I'd like to build random samples of a HBase table's rowkey space.

Say, I'd like to have roughly 1% of the keys from HBase that are randomly distributed across the table. What's the best way of doing this?

I suppose I could write a MapReduce job that processed all the data and pulled 1/100 of the keys... or perhaps use a coprocessor.

Upvotes: 3

Views: 905

Answers (2)

Donald Miner
Donald Miner

Reputation: 39913

I ended up doing this in Pig but for whatever reason it was dreadfully slow. I got the data I needed so I didn't go further, but I should probably try Alexander's answer.

data = LOAD 'hbase://MARS1'
   USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
     'A:*', '-loadKey true')
   AS (id:bytearray, A_map:map[]);

justkeys = FOREACH data GENERATE id;

-- rough estimate of number of keys in hbase table 
smp = SAMPLE justkeys 0.000001;

STORE smp INTO 'key_sample' USING PigStorage('\t');

Upvotes: 0

Alexander Kuznetsov
Alexander Kuznetsov

Reputation: 3122

You can use the RandomRowFilter to get the sample.

Upvotes: 3

Related Questions