filter to join pyspark

Question

Right now I have the following code:

df1 = df.filter((df.col1.isin(List1)) | (df.col2.isin(List2)))

I got List1 from doing collect() from dataframe so instead I like to use join I tried the following

df1=df.filter(df.col2.isin(List2))
df2=df.join(df_List1,'col1','leftsemi')
df3=df1.join(df2,'col1' ,'outer')

I have two questions :

What is proper way to transform the original statement
Does it worthwhile doing performance-wise

Alper t. Turker · Accepted Answer

Does it worthwhile doing performance-wise

As always, when asking performance related questions you should test both on a real data (or one that truthfully reflects the real distribution of data) and using actual resources at your disposal.

That being said, if List1 and List2 are small enough df.filter((df.col1.isin(List1)) | (df.col2.isin(List2))) for query to succeed it is hard to expect any improvement from using joins, as OR based JOINS cannot be easily optimized.

There can expressed as:

SELECT DISTINCT col1, col2 FROM ( 
   SELECT t.* FROM t JOIN r1 WHERE t.col1 = r1.col1
   UNION
   SELECT t.* FROM t JOIN r2 WHERE t.col2 = r2.col2
)

or

WITH 
  t1 AS (SELECT t.* FROM t JOIN r1 WHERE t.col1 = r1.col1),
  t2 AS (SELECT t.* FROM t JOIN r2 WHERE t.col2 = r2.col2) 
SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.col1 = t2.col1  -- if col1 is an unique identifier

In general (we can safely omit case where t is small enough to be stored in memory of a single machine) both will require a full shuffle of t, making size of t a limiting factor.

Also, logical disjunction with local objects can Short-circuit evaluation when the first component is true. It is not possible to do the same in general join case.

In future Spark should support single column DataFrame in isin (SPARK-23945) and optimizer should be able to make a decision between broadcast and hash join for you.

filter to join pyspark

Answers (1)

Related Questions