Reputation: 247
I have two pyspark dataframes,I would like to check first dataframe column value is present in the second column dataframe.If the first data frame column value is not present in second dataframe column, I need to identify those values and write it into list.Is there any better approach to handle this scenario using pyspark ? I appreciate your response.
df[Name].show()
Java
Oracle
.NET
df1[Name].show()
Oracle
Scala
.NET
python
Upvotes: 0
Views: 106
Reputation: 391
You can use left anti join.
PySpark -
df_diff = df.join(df1, df('Name') == df1('Name'), 'left_anti')
SQL
df.registerTempTable('df')
df1.registerTempTable('df1')
df_diff = spark.sql( 'select df.Name from df left anti join df1 on df.name = df1.name')
You can also use Left Join with Where clause
df_diff = df.join(df1, df('Name') == df1('Name'), 'left').filter("df1['Name'] is null")
finaly, to fetch values
df_diff.show()
All that's left to do to make it a list is to fetch all values to a python list (I'm not sure if you want it as a dataframe or python list).
Upvotes: 1
Reputation: 136
a =[]
for col in df[Name].columns:
if col not in df1[Name].columns:
a.append(col)
Upvotes: 0