Reputation: 965
I have a small table adm
with one column x
that contains only 10 rows. Now I want to filter another table big
that is partitioned by y
with the values from adm
using partition pruning.
While here
select * from big b
where b.y = ( select max(a.x) from adm a)
the partition filter pushdown works, but unfortunately this:
select * from big b
where b.y IN (select a.x from adm a )
results in a broadcast join between a
and b
How can the subquery be pushed down as a partition filter even when I use IN
Upvotes: 2
Views: 1472
Reputation: 9427
This is happening because the result of your subquery by itself is an RDD, so Spark deals with it in a truly distributed fashion -- via broadcast and join -- as it would if it were any other column, not necessarily partition.
To work around this, you will need to execute subquery separately, collect the result and format it into a value usable in IN
clause.
scala> val ax = spark.sql("select a.x from adm a")
scala> val inclause = ax.as(Encoders.STRING).map(x => "'"+x+"'").collectAsList().asScala.mkString(",")
scala> spark.sql("select * from big b where b.y IN (" + inclause + ")")
(This assumes x
and y
are strings.)
Upvotes: 1