chardex
chardex

Reputation:

HBase distributed scanner

In the "API usage example" on "Getting started" page in HBase documentation there is an example of scanner usage:

Scanner scanner = table.getScanner(new String[]{"myColumnFamily:columnQualifier1"});

RowResult rowResult = scanner.next();
 while (rowResult != null) {
  //...
  rowResult = scanner.next(); 

}

As I understand, this code will be executed on one machine (name node) and all scanning and filtering work will be not distributed. Only data storing and data loading will be distributed. How can I use distributed scanner, which will work separetly on each node.

Which is the best practise of fast data filtering? Thanks.

Upvotes: 3

Views: 1696

Answers (2)

mibrahim
mibrahim

Reputation: 67

The way the scanner works is by starting on the first region, scanning rows and hopping from one region to the next. A trick that you can do is to create multiple scanners, each one starts and ends on the start and end keys of one region, then create multiple threads that read from all in parallel and write in one output queue. Now your process needs to be fast enough reading, processing and removing items from that queue otherwise you might OOM the client in case you had too many rows coming in too fast. You will also need to use concurrent structures to avoid synchronization delays.

You can retrieve the region information using getRegionLocations on an HTable: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#getRegionLocations()

Also keep in mind that scanners can timeout if you don't read them fast enough, so blocking your consumer threads until your queue becomes empty might not be an option some times.

Upvotes: 1

Tobu
Tobu

Reputation: 25416

This is old, anyway: the scanner is just a cursor-like api for retrieval of computed results. For computation, you use MapReduce jobs (hbase.mapred).

Upvotes: 1

Related Questions