Simon Zeinstra
Simon Zeinstra

Reputation: 815

Loading millions of small files from Azure Data Lake Store to Data Bricks

I've got a partitioned folder structure in the Azure Data Lake Store containing roughly 6 million json files (size couple of kb's to 2 mb). I'm trying to extract some fields from these files using Python code in Data Bricks.

Currently I'm trying the following:

spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", "xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx")
spark.conf.set("dfs.adls.oauth2.credential", "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx/oauth2/token")

df = spark.read.json("adl://xxxxxxx.azuredatalakestore.net/staging/filetype/category/2017/*/")

This example even reads only a part of the files since it points to "staging/filetype/category/2017/". It seems to work and there are some jobs starting when I run these commands. It's just very slow.

Job overview

Job 40 indexes all of the subfolders and is relatively fast Job 40

Job 41 checks a set of the files and seems a bit to fast to be true enter image description here

Then comes job 42, and that's where the slowness starts. It seems to do the same activities as job 41, just... slow enter image description here

I have a feeling that I have a similar problem to this thread. But the speed of job 41 makes me doubtful. Are there faster ways to do this?

Upvotes: 3

Views: 2610

Answers (3)

Naiveforever
Naiveforever

Reputation: 11

we combine files at hourly basis using Azure function and that brings down file processing significantly. So, try combining files before you send it to ADB cluster for processing. IF NOT - either you have a very high number of worker nodes and that might increase your cost.

Upvotes: 1

Michael Rys
Michael Rys

Reputation: 6684

To add to Jason's answer:

We have run some test jobs in Azure Data Lake operating on about 1.7m files with U-SQL and were able to complete the processing in about 20 hours with 10 AUs. The job was generating several thousand extract vertices, so with a larger number of AUs, it could have finished in a fraction of the time.

We have not tested 6m files, but if you are willing to try, please let us know.

In any case, I do concur with Jason's suggestion to reduce the number and make the files larger.

Upvotes: 1

Jason Horner
Jason Horner

Reputation: 3690

I think you will need to look at combining the files before processing. Both to increase size and reduce the number of files. The optimal file size is about 250mb. There are a number of ways to do this perhaps the easiest would be to use azure data lake analytics jobs or even use spark to iterate over a subset of the files

Upvotes: 0

Related Questions