Hive does not MapJoin a small table against the Union of two large tables

Question

I have 3 small tables: smallTable0, smallTable1, and smallTable3. They all have fewer than 100 rows and identical schemas. I also have 3 large tables: largeTable0, largeTable1, and largeTable3. The all have more than 1M rows, have identical schemas, share the id column with the small tables, are are partitioned on something other than id (in case partitioning matters, I suspect it does not).

After setting hive.auto.convert.join=true, the following cases result in a MapJoin, as expected:

Joining smallTable0 against smallTable1
Joining smallTable0 against largeTable0
Joining smallTable0 against the smallTable1 UNION ALL smallTable2

The following cases do not result in a MapJoin, as expected:

Joining largeTable0 against anything.
Joining smallTable0 against anything with hive.auto.convert.join=false

Unexpectedly however, the following case also does not result in a MapJoin:

Joining smallTable0 against largeTable0 UNION ALL largeTable1

The exact query is as follows:

SELECT * FROM smallTable0 s
JOIN (
  SELECT * FROM (
    SELECT * FROM largeTable0
    UNION ALL
    SELECT * FROM largeTable1 
  ) x
) l
  ON s.id = l.id;

It runs fine, but with a Common Join instead of a MapJoin and it is causing a performance hit. Creating a view which represents largeTable0 UNION ALL largeTable1 does not resolve the issue. I'm sure creating a table which is largetTable0 UNION ALL largeTable1 would resolve the issue, but duplicating so much data and then keeping them in sync is undesirable.

The source code for the Union operator (here) has a comment which I find somewhat cryptic.

/**
* Union operators are not allowed either before or after a explicit mapjoin hint.
* Note that, the same query would just work without the mapjoin hint (by setting
* hive.auto.convert.join to true).
**/
@Override
public boolean opAllowedBeforeMapJoin() {
  return false;
}

@Override
public boolean opAllowedAfterMapJoin() {
  return false;
}

It seems to suggest that the UNION operator is not allowed with an explicit MapJoin hint, but that the UNION operator is allowed with MapJoins initiated as a result of hive.auto.convert.join. However I do not understand why one would be allowed the other disallowed. Unless "just work" means that the query will "work", just not with a MapJoin. If this were the case, however, I would have expected joining smallTable0 to smallTable1 UNION ALL smallTable2 to result in a Common Join.

Is the odd behavior a result of a bug in Hive, a bug in my code, a missing feature in Hive, or a misunderstanding on my part?

Hive does not MapJoin a small table against the Union of two large tables

Answers (1)

Related Questions