Reputation: 1

Partition large json files with ADF dataflow

We are getting large json files from an API and we would like to load these files in CosmosDB/MongoDB. There is a size limit for files in CosmosDB/MongoDB.

We are trying to partition the file using ADF dataflow. If we put a certain number of partition, we are always getting 2 files and one of them with 0 byte. We are trying to partition by Key - but ADF is not recognizing the Key and throwing "sink requires partition column error"

We would appreciate any help/guidance with this.

If you want to play with it, below is link for a dummy file https://easyupload.io/hxg67n

Upvotes: 0

Answers (1)

Abhishek Khandave

Reputation: 3240

I would recommend you to use Round robin partition technique.

Round robin distributes data equally across partitions. Use round-robin when you don't have good key candidates to implement a solid, smart partitioning strategy. You can set the number of physical partitions.

Note

You need to evaluate the data size or the partition number of input data, then set reasonable partition number under "Optimize". For example, the cluster that you use in the data flow pipeline execution is 8 cores and the memory of each core is 20GB, but the input data is 1000GB with 10 partitions. If you directly run the data flow, it will meet the OOM issue because 1000GB/10 > 20GB, so it is better to set repartition number to 100 (1000GB/100 < 20GB).

Refer - Supported Sink types – Sink transformation in mapping data flow - Azure Data Factory & Azure Synapse | Microsoft Docs

Upvotes: 1

Partition large json files with ADF dataflow

Answers (1)

Related Questions