Thomas Hahn
Thomas Hahn

Reputation: 171

Strategies to prevent duplicate data in Azure SQL Data Warehouse

At the moment I am setting up an Azure SQL Data Warehouse. I am using Databricks for the ETL process with JSON-files from Azure Blob Storage.

What is the best practice to make sure to not import duplicate dimensions or facts into the Azure SQL Data Warehouse?

This could happen for facts e.g. in the case of en exception while the loading process. And for dimensions this could happen as well if I would not check, which data already exists. I am using the following code to import data into the data warehouse and I found no "mode" which would only import data which not already exists:

spark.conf.set(
   "spark.sql.parquet.writeLegacyFormat",
   "true")

 renamedColumnsDf.write
     .format("com.databricks.spark.sqldw")
     .option("url", sqlDwUrlSmall)
     .option("dbtable", "SampleTable")
     .option( "forward_spark_azure_storage_credentials","True")
     .option("tempdir", tempDir)
     .mode("overwrite")
     .save()

Upvotes: 1

Views: 765

Answers (1)

Ron Dunn
Ron Dunn

Reputation: 3078

Ingest to a staging table, then CTAS to your fact table with a NOT EXISTS clause to eliminate duplicates.

Upvotes: 1

Related Questions