Pyspark - how to drop records by primary keys?

Question

I want to delete the records from an old_df if a new_df has a del flag for the key metric_id. What is the right way to achieve this?

old_df (flag here is filled with nulls on purpose)

+---------+--------+-------------+
|metric_id| flag   |        value|
+---------+--------+-------------+
|       10|    null|       value2|
|       10|    null|       value9|
|       12|    null|updated_value|
|       15|    null|  test_value2|
+---------+--------+-------------+

new_df

+---------+--------+-------------+
|metric_id| flag   |        value|
+---------+--------+-------------+
|       10|     del|       value2|
|       12|    pass|updated_value|
|       15|     del|  test_value2|
+---------+--------+-------------+

result_df

+---------+--------+-------------+
|metric_id| flag   |        value|
+---------+--------+-------------+
|       12|    pass|updated_value|
+---------+--------+-------------+

Pyspark - how to drop records by primary keys?

Answers (1)

Related Questions