Reputation: 2979
Well the the title of the questions says it all. I have a requirement which requires getting row keys corresponding to top X (say top 10) values in certain column. Thus, I need to sort hbase rows by the desired column values. I don't understand how should I do this or even is doable or not. It seems that hbase does not cater to this very well. Also it does not allow any such functionality out of the box.
Q1. Can I use hbase-spark connector, load whole hbase data in spark rdd and then perform sorting in it? Will this be fast? How the connector and spark will handle it? Will it fetch whole data on single node or multiple nodes and sort in distributed manner?
Q2. Also is there any better way to do this?
Q3. Is it undoable in hbase at all? and should I opt for different framework/technology altogether?
Upvotes: 2
Views: 345
Reputation: 3990
A3. If you need to sort your data by some column (not row-key), you get no benefit from using HBase. It'll be the same as reading raw files from hive/hdfs and sort, but slower.
A1. Sure you can use SHC or any other spark-hbase library for that matter, but A3 still holds. It will load the entire data on every region server as Spark RDD, only to shuffle it across your entire cluster.
A2. As any other programming/architecture issue, there are many possible solutions depending on your resources and requirements.
Will spark load all the data on single node and do sorting on single node or will it perform sorting on different nodes?
It depends on two factors:
spark.sql.shuffle.partitions
configuration value: After loading the data from the table, this value determines the parallelism degree for the sorting stage.is there any better [library] than the SHC?
As for today there are multiple libraries for integrating Spark with HBase, each has its own pros and cons, and TMO none of them is fully mature or gives full coverage (compared Spark-Hive integration, for example). To get the best from Spark over HBase you should have a very good understanding of your use case and select the most suitable library.
Upvotes: 2
Reputation: 2472
Q2. Also is there any better way to do this?
If re-designing your HBase table is an option with this specific column value as part of the rowkey
, this would allow fast access to these values as HBase is optimised for rowkey filters and not column filters.
You could then create a rowkey concatenation of the existing_rowkey + this_col_value
. Querying it then with a Row Filter would have better performance results.
Upvotes: 0