Reputation: 10153
Right now I have the following code:
df1 = df.filter((df.col1.isin(List1)) | (df.col2.isin(List2)))
I got List1 from doing collect()
from dataframe so instead I like to use join
I tried the following
df1=df.filter(df.col2.isin(List2))
df2=df.join(df_List1,'col1','leftsemi')
df3=df1.join(df2,'col1' ,'outer')
I have two questions :
Upvotes: 0
Views: 1974
Reputation: 35229
Does it worthwhile doing performance-wise
As always, when asking performance related questions you should test both on a real data (or one that truthfully reflects the real distribution of data) and using actual resources at your disposal.
That being said, if List1
and List2
are small enough df.filter((df.col1.isin(List1)) | (df.col2.isin(List2)))
for query to succeed it is hard to expect any improvement from using joins
, as OR
based JOINS
cannot be easily optimized.
There can expressed as:
SELECT DISTINCT col1, col2 FROM (
SELECT t.* FROM t JOIN r1 WHERE t.col1 = r1.col1
UNION
SELECT t.* FROM t JOIN r2 WHERE t.col2 = r2.col2
)
or
WITH
t1 AS (SELECT t.* FROM t JOIN r1 WHERE t.col1 = r1.col1),
t2 AS (SELECT t.* FROM t JOIN r2 WHERE t.col2 = r2.col2)
SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.col1 = t2.col1 -- if col1 is an unique identifier
In general (we can safely omit case where t
is small enough to be stored in memory of a single machine) both will require a full shuffle of t
, making size of t
a limiting factor.
Also, logical disjunction with local objects can Short-circuit evaluation when the first component is true. It is not possible to do the same in general join case.
In future Spark should support single column DataFrame
in isin
(SPARK-23945) and optimizer should be able to make a decision between broadcast and hash join for you.
Upvotes: 2