Tuomas Tikka
Tuomas Tikka

Reputation: 203

Duplicate Blob Created Events When Writing to Azure Blob Storage from Azure Databricks

We are using an Azure Storage Account (Blob, StorageV2) with a single container in it. We are also using Azure Data Factory to trigger data copy pipelines from blobs (.tar.gz) created in the container. The trigger works fine when creating the blobs from an Azure App Service or by manually uploading via the Azure Storage Explorer. But when creating the blob from a Notebook on Azure Databricks, we get two (2) events for every blob created (same parameters for both events). The code for creating the blob from the notebook resembles:

dbutils.fs.cp(
  "/mnt/data/tmp/file.tar.gz", 
  "/mnt/data/out/file.tar.gz"
)

The tmp folder is just used to assemble the package, and the event trigger is attached to the out folder. We also tried with dbutils.fs.mv, but same result. The trigger rules in Azure Data Factory are:

Blob path begins with: out/

Blob path ends with: .tar.gz

The container name is data.

We did find some similar posts relating to zero-length files, but at least we can't see them anywhere (if some kind of by-product to dbutils).

As mentioned, just manually uploading file.tar.gz works fine - a single event is triggered.

Upvotes: 1

Views: 982

Answers (1)

Tuomas Tikka
Tuomas Tikka

Reputation: 203

We had to revert to uploading the files from Databricks to the Blob Storage using the azure-storage-blob library. Kind of a bummer, but it works now as expected. Just in case anyone else runs into this.

More information:

https://learn.microsoft.com/en-gb/azure/storage/blobs/storage-quickstart-blobs-python

Upvotes: 1

Related Questions