Reputation: 43
We are using JDBC option to insert transformed data in a spark DataFrame to a table in Azure SQL Server. Below is the code snippet we are using for this insert. However, we noticed on few occasions that some records are being duplicated in the destination table. This is happening for large tables. e.g. if a DataFrame has 600K records, after inserting data into the table, we get around 620K records. This is very rare occurrence, but we still want to understand why that's happening.
DataToLoad.write.jdbc(url = jdbcUrl, table = targetTable, mode = "overwrite", properties = jdbcConnectionProperties)
Only reason we could think of is that while inserts are happening in distributed fashion, if one of the executors fail in between, they are being re-tried and could be inserting duplicate records. This could be totally meaningless but just to see if that could be an issue.
Upvotes: 0
Views: 1063
Reputation: 5
Faced the same issue for some records and fixed it using partitioning the dataframe before writing to SQL Server.
You can use the below code for reference:
DataToLoad.coalesce(1).write.jdbc(url = jdbcUrl, table = targetTable, mode = "overwrite", properties = jdbcConnectionProperties)
The reason why duplicate records were added in SQL Server is still unclear but suspecting it might be due to cluster config issues.
Upvotes: 1