mgurov
mgurov

Reputation: 63

HBase scan operation caching

What is the difference between setCaching and setBatch at HBase scan mechanism? What I must use for best performance during scan large data volumes?

Upvotes: 1

Views: 5133

Answers (2)

Saurabh
Saurabh

Reputation: 7833

Specify a scanner cache that will be filled before the Scan result is returned, setting setCaching to the number of rows to cache before returning the result. By default, the caching setting on the table is used. The goal is to balance IO and network load.

public Scan setCaching(int caching)

To limit the number of columns if your table has very wide rows (rows with a large number of columns), use setBatch(int batch) and set it to the number of columns you want to return in one batch. A large number of columns is not a recommended design pattern.

public Scan setBatch(int batch)

this is nice link http://www.cloudera.com/documentation/enterprise/5-5-x/topics/admin_hbase_scanning.html

Upvotes: 1

Rubén Moraleda
Rubén Moraleda

Reputation: 3067

Unless you have super-wide tables with many columns (or very large ones) you should completely forgot about setBatch() and focus exclusively on setCaching():


setCaching(int caching)

Set the number of rows for caching that will be passed to scanners. If not set, the Configuration setting HConstants.HBASE_CLIENT_SCANNER_CACHING will apply. Higher caching values will enable faster scanners but will use more memory.

setBatch(int batch)

Set the maximum number of values to return for each call to next()


setBatch is about the number of values of the row that should be returned on each call/iteration. Here's a nice post about it: http://blog.jdwyah.com/2013/08/hbase-scan-batch-vs-cache.html

Upvotes: 3

Related Questions