Compare Spark Dataframes Where Not Equal With List of Comparison Columns

Question

I'm currently trying to compare two data frames together to see how the fields don't match in pyspark. I have been able to write it manually, but I want to be able to pass a list of fields to ensure that the frames do not match on the fields. The data frames are identical.

The code I have thus far is:

key_cols = ['team_link_uuid', 'team_sat_hash']
temp_team_sat = orig.select(*key_cols)
temp_team_sat_incremental = delta.select(*key_cols)
hash_field = ['team_sat_hash']

test_update_list = temp_team_sat.join(temp_team_sat_incremental, (temp_team_sat.team_link_uuid == temp_team_sat_incremental.team_link_uuid) & (temp_team_sat.team_sat_hash != temp_team_sat_incremental.team_sat_hash))

But now I need to be able to take my list (hash_field) and be able to ensure that the one or many fields are not equal to each other.

Steven · Accepted Answer

assuming fields_to_compare_list is a list of the fields you want to compare,

from functools import reduce

comparaison_query = reduce(
    lambda a,b : (a | b),
    [ temp_team_sat[col] != temp_team_sat_incremental[col] 
      for col 
      in fields_to_compare_list
    ]
)

test_update_list = temp_team_sat.join(
    temp_team_sat_incremental, 
    on = (temp_team_sat.team_link_uuid == temp_team_sat_incremental.team_link_uuid) \
         & comparaison_query

Compare Spark Dataframes Where Not Equal With List of Comparison Columns

Answers (1)

Related Questions