Reputation: 84

How can I query data from HBase table in millisecond?

I'm writing a interface to query pagination data from Hbase table ,I query pagination data by some conditions, but it's very slow .My rowkey like this : 12345678:yyyy-mm-dd , length of 8 random Numbers and date .I try to use Redis cache all rowkeys and do pagination in it , but it's difficult to query data by the other conditions . I also consider to design the secondary index in Hbase , and I discuss it with colleagues ,they think the secondary index is hard to maintain . So , who can give me some ideas?

Upvotes: 1

Answers (3)

Ram Ghadiyaram

Reputation: 29195

First thing, AFAIK random number + date pattern of rowkey may lead to hotspotting, if you scale with large data.

Regarding Pagination :

I'd offer solr + hbase if you are using cloudera then its cloudera search. It gives good performance(proved in our case) while querying 100 per page and with webservice call we have populated angularjs dashboard.

Also, most important thing is you can move back and forth between pages with out any issues..

Below diagram describes that.

To achieve this, you need to create collections(from hbase data) and can use solrj api

Hbase alone with scan api doesn't work for quick queries.

Apart from that, Please see my answer. Which is more insightful with implementation details...

How to achieve pagination in HBase?

Hbase only solution could be Hindex (co-processor based solution)

Link explains more in detail

Hindex architecture :

Upvotes: 3

John Leach

Reputation: 528

If you are looking for secondary indexes that are maintained in HBase there are several open source options (Splice Machine, Lilly, etc.). You can do index lookups in a few milliseconds.

Upvotes: 0

featuredpeow

Reputation: 2201

In Hbase to achieve good read performance you want your data retrieved by small number of gets (requests for single row) or a small scan (request over range of rows). Hbase stores your data sorted by key, so most important idea is to come up with such row key that would allow it. Your key seems to contain only random integer and date so I assume that your queries are about pagination over records marked with time.

First idea is that in typical pagination scenario you access just 1 page at a time and navigate from page 1 to page 2 to page 3 etc. Given you want to paginate over all records for date 2015-08-16 you could use a scan of 50 rows with start key '\0:2015-08-16' (as it is smaller than any row in 2015-08-16) to retrieve first page. After retrieval of first page you have last key from a first page, say '12345:2015-08-16'. You can use it (or 12346:2015-08-16) to make another scan with start key 12346:2015-08-16 of 50 rows to retrieve page 2 and so on. So using this approach you query your pages fast as a single scan with predefined number of returned rows. So you can use last page row key as a parameter to paging API or just put last row key in redis so next paging API call will find it there.

All this works perfectly well until some user comes in and clicks directly to page 100. Or try to click on page 5 when he was on page 2. In such scenario you can use similar scan with nSkippedPages * 50 rows. This will not be as fast as a sequential access, but it's not a usual page usage pattern. You can use redis then to cache last row of the page result in a structure like pageNumber -> rowKey. Then if next user comes and clicks on page 100, it will see same performance as is in usual click page 1- click page 2- click page 3 scenario.

Then to make things more fast for users which click on page 99 first time, you could write a separate daemon which retrieves every 50th row and puts result in redis as a page index. Then launch it every 10-15 minutes and say that your page index has at most 10-15 minutes stale data.

You also can design a separate API which preloads row keys for a bulk of N pages (say about 100 pages, it could be async e.g. don't wait for actual preload to complete). What it will do is just a scan with KeyOnlyFilter and 50*N results and then selection of rowkeys for each page. So it accepts rowkey and populates redis with rowkey cache for N pages. Then when user walks in on a first page you fetch first 100 pages row keys for him so when he clicks on some page link seen on page, page start row key will be available. With right bulk size of preload you could approach your required latency.

Limit could be implemented using Scan.setMaxResults() or using PageFilter. "skip nPages * 50 rows" and especially "output every 50th row" functionality seems to be trickier e.g. for latter you may end-up performing full scan which retrieves the keys or writing map-reduce to do it and for first it is not clear how to do it without sending rows over network since request can be distributed across several regions.

Upvotes: 2