Reputation: 2573
I have 3 small tables: smallTable0
, smallTable1
, and smallTable3
. They all have fewer than 100 rows and identical schemas. I also have 3 large tables: largeTable0
, largeTable1
, and largeTable3
. The all have more than 1M rows, have identical schemas, share the id
column with the small tables, are are partitioned on something other than id
(in case partitioning matters, I suspect it does not).
After setting hive.auto.convert.join=true
, the following cases result in a MapJoin, as expected:
smallTable0
against smallTable1
smallTable0
against largeTable0
smallTable0
against the smallTable1 UNION ALL smallTable2
The following cases do not result in a MapJoin, as expected:
largeTable0
against anything.smallTable0
against anything with hive.auto.convert.join=false
Unexpectedly however, the following case also does not result in a MapJoin:
smallTable0
against largeTable0 UNION ALL largeTable1
The exact query is as follows:
SELECT * FROM smallTable0 s
JOIN (
SELECT * FROM (
SELECT * FROM largeTable0
UNION ALL
SELECT * FROM largeTable1
) x
) l
ON s.id = l.id;
It runs fine, but with a Common Join instead of a MapJoin and it is causing a performance hit. Creating a view which represents largeTable0 UNION ALL largeTable1
does not resolve the issue. I'm sure creating a table which is largetTable0 UNION ALL largeTable1
would resolve the issue, but duplicating so much data and then keeping them in sync is undesirable.
The source code for the Union operator (here) has a comment which I find somewhat cryptic.
/**
* Union operators are not allowed either before or after a explicit mapjoin hint.
* Note that, the same query would just work without the mapjoin hint (by setting
* hive.auto.convert.join to true).
**/
@Override
public boolean opAllowedBeforeMapJoin() {
return false;
}
@Override
public boolean opAllowedAfterMapJoin() {
return false;
}
It seems to suggest that the UNION operator is not allowed with an explicit MapJoin hint, but that the UNION operator is allowed with MapJoins initiated as a result of hive.auto.convert.join
. However I do not understand why one would be allowed the other disallowed. Unless "just work" means that the query will "work", just not with a MapJoin. If this were the case, however, I would have expected joining smallTable0
to smallTable1 UNION ALL smallTable2
to result in a Common Join.
Is the odd behavior a result of a bug in Hive, a bug in my code, a missing feature in Hive, or a misunderstanding on my part?
Upvotes: 0
Views: 1429
Reputation: 6443
You can specify hints to Hive for handling the tables during a join. I always specify MAPJOIN or STREAMTABLE if I know if a small table is candidate for join, or the very large table that should be streamed to the others.
e.g.
SELECT /*+ MAPJOIN(smalltable0) */ * FROM smallTable0 s
Upvotes: 1