N9909
N9909

Reputation: 247

How to check the column values from different dataframes?

I have two pyspark dataframes,I would like to check first dataframe column value is present in the second column dataframe.If the first data frame column value is not present in second dataframe column, I need to identify those values and write it into list.Is there any better approach to handle this scenario using pyspark ? I appreciate your response.

df[Name].show()
Java
Oracle
.NET

df1[Name].show()
Oracle
Scala
.NET
python

Upvotes: 0

Views: 106

Answers (2)

Assaf Segev
Assaf Segev

Reputation: 391

You can use left anti join.

PySpark -

df_diff = df.join(df1, df('Name') == df1('Name'), 'left_anti')

SQL

df.registerTempTable('df')
df1.registerTempTable('df1')
df_diff = spark.sql( 'select df.Name from df left anti join df1 on df.name = df1.name')

You can also use Left Join with Where clause

df_diff = df.join(df1, df('Name') == df1('Name'), 'left').filter("df1['Name'] is null")

finaly, to fetch values

df_diff.show()

All that's left to do to make it a list is to fetch all values to a python list (I'm not sure if you want it as a dataframe or python list).

Upvotes: 1

Joshua
Joshua

Reputation: 136

a =[]
for col in df[Name].columns:
    if col not in df1[Name].columns:
        a.append(col)

Upvotes: 0

Related Questions