Optimize Azure Data Factory copy of 10.000+ JSON files from BLOB storage to ADLS G2

Question

Situation: Every day a bunch of JSON files are generated and put into Azure BLOB storage. Also every day an Azure data factory copy jobs makes a look up in the blob storage and does a "Filter by last modified":

Start time: @adddays(utcnow(),-2)
End time: @utcnow()

The files are copied to Azure Datalake Gen2.

On normal days with 50-100 new JSON-files the copy jobs goes fine but at the last day of every quarter the number of new JSON-files increases to 10.000+ files and then the copy job fails with the message "ErrorCode=SystemErrorFailToInsertSubJobForTooLargePayload,….."

Therefore I have made a new copy job that uses a for each loop to run parallel copy jobs. This can copy much larger volumes of files, but it still takes a couple of minutes per file and I have not seen more than around 500 files per hour being copied, so that is still not fast enough.

Therefore I am searching for more ways to optimize the copy. I have put in a couple of screen shots but can give more details on specifics.

Optimize Azure Data Factory copy of 10.000+ JSON files from BLOB storage to ADLS G2

Answers (1)

Related Questions