codeBarer
codeBarer

Reputation: 2388

Is it bad to do joins in Hive?

Hi I recently joined a new job that uses Hive and PostgreSQL. The existing ETL scripts gather data from Hive partitioned by dates and creates tables for those data in PostgreSQL and then the PostgreSQL scripts/queries perform left joins and create the final table for reporting purpose. I have heard in the past that Hive joins are not a good idea. However, I noticed that Hive does allow joins so I'm not sure why it's a bad idea.

I wanted to use something like Talend or Mulesoft to create joins and do aggregations within hive and create a temporary table and transfer that temporary table as the final table to PostgreSQL for reporting.

Any suggestions, especially if this is not a good practice with HIVE. I'm new to hive.

Thanks.

Upvotes: 0

Views: 1808

Answers (3)

Dennis Jaheruddin
Dennis Jaheruddin

Reputation: 21563

Hive is growing in maturity

It is possible that arguments against using joins, no longer apply for recent versions of hive.

The most clear example I found in the manual section on join optimization:

The MAPJOIN implementation prior to Hive 0.11 has these limitations:

The mapjoin operator can only handle one key at a time

Therefore I would recommend asking what the foundation of their reluctance is, and then checking carefully whether it still applies. Their arguments may well still be valid, or might have been resolved.


Sidenote: Personally I find Pig code much easier to re-use and maintain than hive, consider using Pig rather than hive to do map-reduce operations on your (hive table) data.

Upvotes: 1

Shubham Sahay
Shubham Sahay

Reputation: 88

Its perfectly fine to do joins in HIVE, I am a ETL tester and have performed left joins on big tables in Hive most of the time the queries run smoothly but some times the job do get stuck or are slow due to network traffic.

Also depends on number of Nodes the cluster is having.

Thanks

Upvotes: -1

nsousa
nsousa

Reputation: 4544

The major issue with joining in hive has to do with data locality.

Hive queries are executed as MapReduce jobs and several mappers will launch, as much as possible, in the nodes where the data lies.

However, when joining tables the two rows of data from LHS and RHS tables will not in general be in the same node, which may cause a significant amount of network traffic between nodes.

Joining in Hive is not bad per se, but if the two tables being joined are large may result in slow jobs.

If one of the tables is significantly smaller than the other you may want to store it in HDFS cache, making its data available in every node, which allows the join algorithm to retrieve all data locally.

So, there's nothing wrong with running large joins in Hive, you just need to be aware they need their time to finish.

Upvotes: 2

Related Questions