Reputation: 107
I have a Pyspark dataframe df1. It has columns like col1, col2, col3, col4, col5.
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|A |A |X |Y |Y |
|B |C |Y |Y |Z |
|D |D |Z |X |Z |
|E |E |M |M |A |
+----+----+----+----+----+
Have created a list of tuple storing column pairs to compare
t1 = [('col1', 'col2'), ('col3', 'col5')]
Now I want to create new column mismatched_data which should maintain the col name and its original and new value based on the tuple -
+----+----+----+----+----+------------------------------------------------------+
|col1|col2|col3|col4|col5|mismatched_data |
+----+----+----+----+----+------------------------------------------------------+
|A |A |X |Y |Y |[{"col3": {"original_value": "X", "new_value": "Y"}}] |
|B |C |Y |Y |Z |[{"col1": {"original_value": "B", "new_value": "C"}}, {"col3":
{"original_value": "Y", "new_value": "Z"}}] |
|D |D |Z |X |Z |NULL |
|E |E |M |M |A |[{"col3": {"original_value": "M", "new_value": "A"}}] |
+----+----+----+----+----+------------------------------------------------------+
Please let me know how I can achieve this in an efficient manner.
Upvotes: 0
Views: 57
Reputation: 107
t1 = [('col1', 'col2'), ('col3', 'col5')]
mismatch_expressions = []
for col1, col2 in t1:
mismatch_expr = F.when(F.col(col1) != F.col(col2), F.create_map(F.lit(col1), F.create_map(F.lit("original_value"), F.col(col1), F.lit("new_value"), F.col(col2))))
mismatch_expressions.append(mismatch_expr)
df1 = df1.withColumn("mismatched_data", F.array(*mismatch_expressions))
df_filtered = df1.withColumn("mismatched_data", F.expr("filter(mismatched_data, x -> x is not null)"))
df_filtered.show(truncate=False)
Above logic gives below expected output -
+----+----+----+----+----+--------------------------------------------------------------------------------------------------+
|col1|col2|col3|col4|col5|mismatched_data |
+----+----+----+----+----+--------------------------------------------------------------------------------------------------+
|A |A |X |Y |Y |[{col3 -> {original_value -> X, new_value -> Y}}] |
|B |C |Y |Y |Z |[{col1 -> {original_value -> B, new_value -> C}}, {col3 -> {original_value -> Y, new_value -> Z}}]|
|D |D |Z |X |Z |[] |
|E |E |M |M |A |[{col3 -> {original_value -> M, new_value -> A}}] |
+----+----+----+----+----+--------------------------------------------------------------------------------------------------+
Upvotes: 0