Joha
Joha

Reputation: 965

Spark filter pushdown with multiple values in subquery

I have a small table adm with one column x that contains only 10 rows. Now I want to filter another table big that is partitioned by y with the values from adm using partition pruning.

While here

select * from big b 
where b.y = ( select max(a.x) from adm a)

the partition filter pushdown works, but unfortunately this:

select * from big b
where b.y IN (select a.x from adm a )

results in a broadcast join between a and b

How can the subquery be pushed down as a partition filter even when I use IN

Upvotes: 2

Views: 1472

Answers (1)

mazaneicha
mazaneicha

Reputation: 9427

This is happening because the result of your subquery by itself is an RDD, so Spark deals with it in a truly distributed fashion -- via broadcast and join -- as it would if it were any other column, not necessarily partition.
To work around this, you will need to execute subquery separately, collect the result and format it into a value usable in IN clause.

scala> val ax = spark.sql("select a.x from adm a")
scala> val inclause = ax.as(Encoders.STRING).map(x => "'"+x+"'").collectAsList().asScala.mkString(",")
scala> spark.sql("select * from big b where b.y IN (" + inclause + ")")

(This assumes x and y are strings.)

Upvotes: 1

Related Questions