Reputation: 281
I am trying to understand what would be the best big data solution for reporting purposes?
Currently I narrowed it down to HBase vs Hive.
The use case is that we have hundreds of terabytes of data with hundreds different files. The data is live and gets updated all the time. We need to provide the most efficient way to do reporting. We have dozens different reports pages where each report consist of different type of numeric and graph data. For instance:
The way I see it, there are 3 solutions:
Store all data in Hadoop and do the queries in Hive. This might work but I am not sure about the performance. How will it perform when the data is 100 TB? Also, Having Hadoop as the main data base is probably not the best solution as update operation will be hard to achieve, right?
Store all data in HBase and do the queries using Phoenix. This solution is nice but HBase is a key/value store. If I join on a key that is not indexed then HBase will do a full scan which will probably be even worse than Hive. I can put index on columns but that will require to put an index on almost each column which is I think not the best recommendation.
Store all data in HBase and do the queries in Hive that communicates with HBase using it propriety bridge.
Upvotes: 0
Views: 2157
Reputation: 1647
Respective responses on your suggested solutions (based on my personal experience with similar problem):
1) You should not think of Hive as a regular RDMS as it is best suited for Immutable data. So it is like killing your box if you want to do updates using Hive.
2) As suggested by Paul, in comments you can use Phoenix to create indexes but we tried it and it will be really slow with the volume of data that you suggested (we saw slowness in Hbase with ~100 GB of data.)
3) Hive with Hbase is slower than Phoenix (we tried it and Phoenix worked faster for us)
If you are going to do updates, then Hbase is the best option that you have and you can use Phoenix with it. However if you can make the updates using Hbase, dump the data into Parquet and then query it using Hive it will be super fast.
Upvotes: 1
Reputation: 75
The first solution (Store all data in Hadoop and do the queries in Hive), won't allow you to update data. You can just insert to the hive table. Plain hive is pretty slow, as for me it's better to use Hive LLAP or Impala. I've used Impala, it's show pretty good performance, but it's can efficiently, only one query per time. Certainly, update rows in Impala isn`t possible too.
The third solution will get really slow join performance. I've tried Impala with HBase, and join works extremely slow.
About processing data size and cluster size ratio for Impala, https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_cluster_sizing.html
If you need rows update, you can try Apache Kudu. Here you can find integration guide for Kudu with Impala: https://www.cloudera.com/documentation/enterprise/5-11-x/topics/impala_kudu.html
Upvotes: 0
Reputation: 11
You can use a lambda structure which is , hbase along with some stream-compute tools such as spark streaming. You store data in hbase ,and when there is new data coming ,update both original data and report by stream-compute. When a new report is created ,you can generate it from a full-scan of hbase, after that ,the report can by updated by stream-compute. You can also use a map-reduce job to adjust the stream-compute result periodically.
Upvotes: 0