JonJagd
JonJagd

Reputation: 65

Optimize Azure Data Factory copy of 10.000+ JSON files from BLOB storage to ADLS G2

Situation: Every day a bunch of JSON files are generated and put into Azure BLOB storage. Also every day an Azure data factory copy jobs makes a look up in the blob storage and does a "Filter by last modified":

Start time: @adddays(utcnow(),-2)
End time: @utcnow()

The files are copied to Azure Datalake Gen2.

On normal days with 50-100 new JSON-files the copy jobs goes fine but at the last day of every quarter the number of new JSON-files increases to 10.000+ files and then the copy job fails with the message "ErrorCode=SystemErrorFailToInsertSubJobForTooLargePayload,….."

Therefore I have made a new copy job that uses a for each loop to run parallel copy jobs. This can copy much larger volumes of files, but it still takes a couple of minutes per file and I have not seen more than around 500 files per hour being copied, so that is still not fast enough.

Therefore I am searching for more ways to optimize the copy. I have put in a couple of screen shots but can give more details on specifics. enter image description here enter image description here enter image description here enter image description here

Upvotes: 1

Views: 575

Answers (1)

Utkarsh Pal
Utkarsh Pal

Reputation: 4544

The issue is with the size of payload which is unable to process using the current configuration (expecting you are using default settings).

You can optimize the Copy activity performance by considering the underlying changes in your Azure Data Factory (ADF) environment.

  • Data Integration Units
  • Self-hosted integration runtime scalability
  • Parallel copy
  • Staged copy

You can try these Performance Tuning Steps in your ADF to increase the performance.

Configure the copy optimization features in settings tab.

enter image description here

Refer Copy activity performance optimization features for more details and better understanding.

Upvotes: 1

Related Questions