How to check the column values from different dataframes?

Question

I have two pyspark dataframes,I would like to check first dataframe column value is present in the second column dataframe.If the first data frame column value is not present in second dataframe column, I need to identify those values and write it into list.Is there any better approach to handle this scenario using pyspark ? I appreciate your response.

df[Name].show()
Java
Oracle
.NET

df1[Name].show()
Oracle
Scala
.NET
python

Assaf Segev · Accepted Answer

You can use left anti join.

PySpark -

df_diff = df.join(df1, df('Name') == df1('Name'), 'left_anti')

SQL

df.registerTempTable('df')
df1.registerTempTable('df1')
df_diff = spark.sql( 'select df.Name from df left anti join df1 on df.name = df1.name')

You can also use Left Join with Where clause

df_diff = df.join(df1, df('Name') == df1('Name'), 'left').filter("df1['Name'] is null")

finaly, to fetch values

df_diff.show()

All that's left to do to make it a list is to fetch all values to a python list (I'm not sure if you want it as a dataframe or python list).

How to check the column values from different dataframes?

Answers (2)

Related Questions