I have a HBase table with about 50 million rows and each row has several columns. My goal is to retrieve from the table those rows who have a given value in a given column, e.g. rows whose column 'col_1' has value 'val_1'. I have two options to choose: scan through the table from the beginning to the end, and check each row and see if it should be retrieved or not; build indices for this table (e.g., indices for values in column 'col_1'), then for a given column value 'val_1', get all the row keys associated with this index 'val_1', then go through these row keys and retrieve the corresponding rows.This in my mind will involve random access to the original hbase table. Does anyone give me some suggestions about which option runs faster, or you have another better option? Thanks a lot!

RecSys_2010

Reputation: 375

HBase access and index

I have a HBase table with about 50 million rows and each row has several columns. My goal is to retrieve from the table those rows who have a given value in a given column, e.g. rows whose column 'col_1' has value 'val_1'.

I have two options to choose:

scan through the table from the beginning to the end, and check each row and see if it should be retrieved or not;
build indices for this table (e.g., indices for values in column 'col_1'), then for a given column value 'val_1', get all the row keys associated with this index 'val_1', then go through these row keys and retrieve the corresponding rows.This in my mind will involve random access to the original hbase table.

Does anyone give me some suggestions about which option runs faster, or you have another better option?

Thanks a lot!

Upvotes: 3

Answers (3)

mibrahim

Reputation: 11

Secondary index will be faster. You can also try a secondary index library like culvert, instead of creating your own index.

Upvotes: 1

Arnon Rotem-Gal-Oz

Reputation: 25939

An index will surely work faster than scanning 50M rows every time. If you use an hbase version that already has coprocessors you can follow Xodarap advice. If you are using older versions of Hbase you need to setup an additional table to act as the index and update manually (either everytime you update the main table or occasionally via map/reduce)

Upvotes: 2

Xodarap

Reputation: 11859

Are you asking whether adding an index will make it faster? The answer is of course yes. You can see the wiki for thoughts on secondary indexes in HBase.

Upvotes: 4

HBase access and index

Answers (3)

Related Questions