Does HBase and Hive need to be co-located on the same machine?

Question

We currently build a new hbase cluster. The architecture is as follow:

now I want to add a new hive cluster. But I don't know if I should build hive on the same hadoop cluster with hbase. Or should I build a new hadoop cluster for hive?

If I build a new hadoop cluster for hive, will the traffic between hive and hbase be too huge when I execute SQL to aggregate data of hbase (AFAIK, hive need to import the data on hbase to its own hdfs storage)?

Alex · Accepted Answer

Lets first define what is the traffic between HBase and Hive. In every Hive query you will either:

use Hive for querying HBase directly
use Hive to perform Join requests of Warehouse tables with HBase table
use Hive to perform Join requests of external tables with HBase table

If you take a deep dive into the architecture of all these components you will notice that Hive Warehouse and Hbase both use HDFS behind the scene. Same can be considered to external tables which are located in HDFS. So if you deploy Hive into current architecture, there internal data would physically be stored at the same place as HBase - in the DataNodes of your data Rack. It means that you traffic would be optimal for the tasks which will be scheduled by YARN through your Hive queries. If you create another Rack for hive Warehouse with a set of separate DataNodes located at another Data Center, this would negatively impact your performance.

So if you have enough capacities to keep data within single Rack for all your services, go this way. You can read more about Hive and HBase infrastructure in referenced links.

Does HBase and Hive need to be co-located on the same machine?

Answers (1)

Related Questions