David Greenshtein
David Greenshtein

Reputation: 538

HBase pagination using Java API wrong behaviour

I'm using Hbase 0.98.4.2.2.0.0-2041-hadoop2 running on 9 nodes. My table distributed to 12 regions and contains about 113M records.

I'm running pagination query using

Filter pageFilter = new PageFilter(pageSize);
Scan scan = new Scan();
RegexStringComparator comp = new RegexStringComparator("._1");
RowFilter rowFilter = new RowFilter(CompareOp.EQUAL, comp);
FilterList filterList = new FilterList(Operator.MUST_PASS_ALL, pageFilter, rowFilter);
scan.setFilter(filterList);

My page size is 100K, on page 30 query returns 0 results, hence I get only 3M results, but when I run query using hbase shell I get 14M.

Here is hbase shell query:

scan 'mgr', {COLUMNS => 'mtf:f',FILTER => org.apache.hadoop.hbase.filter.RowFilter.new(CompareFilter::CompareOp.valueOf('EQUAL'),RegexStringComparator.new("._1"))}

Why my Java query pagination returns less results than hbase shell query? Maybe I miss some configuration on client side?

thanks.

Upvotes: 0

Views: 895

Answers (1)

d4rxh4wx
d4rxh4wx

Reputation: 91

PageFilter is not stateless. If it is first declared in an AND operation (ie PAGE_FILTER AND MY_FILTER), it can lead to false results. Indeed, counter of PageFilter is incremented even if the row does not match your other filter. You have to declared it as the 2nd filter (ie MY_FILTER AND PAGE_FILTER), so that page filtering is applied only on rows returned by your first filter.

Also, remember that filters apply to all regions. So you may miss several rows with a PageFilter.

So in your case, I don't think you need pagination. I would just go with a Scan and your regex filter without a PageFilter. You then get a ResultScanner and call next() several times until you get all rows (ie stopping when next() is returning null)

Upvotes: 1

Related Questions