Reputation: 1712

Is Hive QL IN bad for performance?

I have a Hive SQL like

select 5 in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);

When I explain it it tell me that hive will scan this one:

== Physical Plan ==
*(1) Project [true AS (5 IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20))
+- Scan OneRowRelation[]

Does that mean hive will perform a sequence scan instead of a set look up on this set (which cause a bad performance)?

Upvotes: 0

Answers (1)

Reputation: 553

The answer can be found in the implementation of org.apache.hadoop.hive.ql.udf.generic.GenericUDFIn.

Simply put, it depends on the args in the parentheses. Specifically whether they are all constants.

In the case of all constants, a HashSet will be initialized once for each mapper respectively. And when evaluation is done at one row, just check if the value is in the set.
And if at least one of the arg is not a constant, that is, args depend on the row being evaluated, then a for loop that goes through all the args is executed for each row. So in this case, long argument list slows down the execution.

See the code here for more details: GenericUDFIn.

Have a nice day :)

Upvotes: 1