Strategies to prevent duplicate data in Azure SQL Data Warehouse

Question

At the moment I am setting up an Azure SQL Data Warehouse. I am using Databricks for the ETL process with JSON-files from Azure Blob Storage.

What is the best practice to make sure to not import duplicate dimensions or facts into the Azure SQL Data Warehouse?

This could happen for facts e.g. in the case of en exception while the loading process. And for dimensions this could happen as well if I would not check, which data already exists. I am using the following code to import data into the data warehouse and I found no "mode" which would only import data which not already exists:

spark.conf.set(
   "spark.sql.parquet.writeLegacyFormat",
   "true")

 renamedColumnsDf.write
     .format("com.databricks.spark.sqldw")
     .option("url", sqlDwUrlSmall)
     .option("dbtable", "SampleTable")
     .option( "forward_spark_azure_storage_credentials","True")
     .option("tempdir", tempDir)
     .mode("overwrite")
     .save()

Ron Dunn · Accepted Answer

Ingest to a staging table, then CTAS to your fact table with a NOT EXISTS clause to eliminate duplicates.

Strategies to prevent duplicate data in Azure SQL Data Warehouse

Answers (1)

Related Questions