Reputation: 97
In azure datafactory how to copy data from blob to sql without duplication ie if the pipeline runs at the slice of every 15 min then how to avoid getting duplicate data
Upvotes: 2
Views: 4770
Reputation: 1
I had the same problem and found this link to be helpful: https://www.mssqltips.com/sqlservertip/6365/incremental-file-load-using-azure-data-factory/
In our case, we only add files to blob storage and never modify them after that so the job is to simply pick up new files created within the latest 15 minutes and add them to the SQL container. The Incremental Copy procedure described in the link seems to work great so far.
I can imagine that in some cases you may need to add a stored procedure to act on the SQL container after this, but we did not need it.
Upvotes: 0
Reputation: 1
I had the same issue and I found that you can add the slice start time and slice end time to your stored procedure and filter the queries using them as any other parameter, that will help you to load the data by slices and not the same data the number of slices you have, hope it's clear enough.
"typeProperties": {
"storedProcedureName": "sp_sample",
"storedProcedureParameters": {
"DateTime": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}', SliceStart)"
}
}
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-stored-proc-activity
Upvotes: 0
Reputation: 131
The solution isn't automatic, but you can use a Copy Activity, and use a stored procedure in the SQL sink to handle rows that may already exist. Perhaps TSQL Merge statement, or an Insert / Update statement inside.
https://azure.microsoft.com/en-us/documentation/articles/data-factory-copy-activity/
Invoke stored procedure for SQL Sink. When copying data into SQL Server or Azure SQL Database, a user specified stored procedure could be configured and invoked.
Thanks, Jason
Upvotes: 1