Suraj Pandey
Suraj Pandey

Reputation: 107

Pyspark find columns with mismatched data

I have a Pyspark dataframe df1. It has columns like col1, col2, col3, col4, col5.

+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|A   |A   |X   |Y   |Y   |
|B   |C   |Y   |Y   |Z   |
|D   |D   |Z   |X   |Z   |
|E   |E   |M   |M   |A   |
+----+----+----+----+----+

Have created a list of tuple storing column pairs to compare

t1 = [('col1', 'col2'), ('col3', 'col5')]

Now I want to create new column mismatched_data which should maintain the col name and its original and new value based on the tuple -

+----+----+----+----+----+------------------------------------------------------+
|col1|col2|col3|col4|col5|mismatched_data                                       |
+----+----+----+----+----+------------------------------------------------------+
|A   |A   |X   |Y   |Y   |[{"col3": {"original_value": "X", "new_value": "Y"}}] |
|B   |C   |Y   |Y   |Z   |[{"col1": {"original_value": "B", "new_value": "C"}}, {"col3":  
{"original_value": "Y", "new_value": "Z"}}] |
|D   |D   |Z   |X   |Z   |NULL                                                  |
|E   |E   |M   |M   |A   |[{"col3": {"original_value": "M", "new_value": "A"}}] |
+----+----+----+----+----+------------------------------------------------------+

Please let me know how I can achieve this in an efficient manner.

Upvotes: 0

Views: 57

Answers (1)

Suraj Pandey
Suraj Pandey

Reputation: 107

t1 = [('col1', 'col2'), ('col3', 'col5')]
mismatch_expressions = []
for col1, col2 in t1:
    mismatch_expr = F.when(F.col(col1) != F.col(col2), F.create_map(F.lit(col1), F.create_map(F.lit("original_value"), F.col(col1), F.lit("new_value"), F.col(col2))))
    mismatch_expressions.append(mismatch_expr)

df1 = df1.withColumn("mismatched_data", F.array(*mismatch_expressions))
df_filtered = df1.withColumn("mismatched_data", F.expr("filter(mismatched_data, x -> x is not null)"))
df_filtered.show(truncate=False)

Above logic gives below expected output -

+----+----+----+----+----+--------------------------------------------------------------------------------------------------+
|col1|col2|col3|col4|col5|mismatched_data                                                                                   |
+----+----+----+----+----+--------------------------------------------------------------------------------------------------+
|A   |A   |X   |Y   |Y   |[{col3 -> {original_value -> X, new_value -> Y}}]                                                 |
|B   |C   |Y   |Y   |Z   |[{col1 -> {original_value -> B, new_value -> C}}, {col3 -> {original_value -> Y, new_value -> Z}}]|
|D   |D   |Z   |X   |Z   |[]                                                                                                |
|E   |E   |M   |M   |A   |[{col3 -> {original_value -> M, new_value -> A}}]                                                 |
+----+----+----+----+----+--------------------------------------------------------------------------------------------------+

Upvotes: 0

Related Questions