Reputation: 121
I have a Delta Lake table in Azure. I'm using Databricks. When we add new entries we use merge into to prevent duplicates from getting into the table. However, duplicates did get into the table. I'm not sure how it happened. Maybe the merge into conditions weren't setup properly.
However it happened the duplicates are there. Is there any way to detect and remove the duplicates from the table? All the documentation I've found shows how to deduplicate the dataset before merging. Nothing for once the duplicates are already there. How can I remove the duplicates?
Thanks
Upvotes: 2
Views: 5068
Reputation: 358
I would suggest the following SOP:
Upvotes: 0
Reputation: 3008
In order to remove the duplicates you can follow the below approach:
Once you follow the above steps your table would not have duplicate rows but this is just a workaround to make your table consistent so it does not have duplicate records and not a permanent solution.
Before or after you follow the above steps you will have to look into your merge into
statements to see if that is written correctly so that it does not insert duplicate records. If the merge into
statement is proper make sure that the dataset that you are processing is not having duplicate records from the source from where you are reading the data.
Upvotes: 0
Reputation: 1
you can use dataset.dropDuplicates
to delete duplicates based on columns.
Upvotes: 0
Reputation: 210
If the duplicate exists in the target table, your only options are:
Upvotes: 0