Daniel Koverman
Daniel Koverman

Reputation: 2573

Hive does not MapJoin a small table against the Union of two large tables

I have 3 small tables: smallTable0, smallTable1, and smallTable3. They all have fewer than 100 rows and identical schemas. I also have 3 large tables: largeTable0, largeTable1, and largeTable3. The all have more than 1M rows, have identical schemas, share the id column with the small tables, are are partitioned on something other than id (in case partitioning matters, I suspect it does not).

After setting hive.auto.convert.join=true, the following cases result in a MapJoin, as expected:

The following cases do not result in a MapJoin, as expected:

Unexpectedly however, the following case also does not result in a MapJoin:

The exact query is as follows:

SELECT * FROM smallTable0 s
JOIN (
  SELECT * FROM (
    SELECT * FROM largeTable0
    UNION ALL
    SELECT * FROM largeTable1 
  ) x
) l
  ON s.id = l.id;

It runs fine, but with a Common Join instead of a MapJoin and it is causing a performance hit. Creating a view which represents largeTable0 UNION ALL largeTable1 does not resolve the issue. I'm sure creating a table which is largetTable0 UNION ALL largeTable1 would resolve the issue, but duplicating so much data and then keeping them in sync is undesirable.

The source code for the Union operator (here) has a comment which I find somewhat cryptic.

/**
* Union operators are not allowed either before or after a explicit mapjoin hint.
* Note that, the same query would just work without the mapjoin hint (by setting
* hive.auto.convert.join to true).
**/
@Override
public boolean opAllowedBeforeMapJoin() {
  return false;
}

@Override
public boolean opAllowedAfterMapJoin() {
  return false;
}

It seems to suggest that the UNION operator is not allowed with an explicit MapJoin hint, but that the UNION operator is allowed with MapJoins initiated as a result of hive.auto.convert.join. However I do not understand why one would be allowed the other disallowed. Unless "just work" means that the query will "work", just not with a MapJoin. If this were the case, however, I would have expected joining smallTable0 to smallTable1 UNION ALL smallTable2 to result in a Common Join.

Is the odd behavior a result of a bug in Hive, a bug in my code, a missing feature in Hive, or a misunderstanding on my part?

Upvotes: 0

Views: 1429

Answers (1)

libjack
libjack

Reputation: 6443

You can specify hints to Hive for handling the tables during a join. I always specify MAPJOIN or STREAMTABLE if I know if a small table is candidate for join, or the very large table that should be streamed to the others.

e.g.

SELECT /*+ MAPJOIN(smalltable0) */ * FROM smallTable0 s

Upvotes: 1

Related Questions