mrammah
mrammah

Reputation: 225

Reading Millions of Small JSON Files from S3 Bucket in PySpark Very Slow

I have a folder (path = mnt/data/*.json) in s3 with millions of json files (each file is less than 10 KB). I run the following code:

df = (spark.read
           .option("multiline", True)
           .option("inferSchema", False)
           .json(path))
display(df)

The problem is that it is very slow. Spark creates a job for this with one task. The task appears to have no more executors running it which usually signifies the completion of a job (right?), but for some reason the command cell in DataBricks is still running. It's been stuck like this for 10min. I feel something as simple as this should take no more than 5minutes.

screenshot of spark job

Notes to consider:

Upvotes: 6

Views: 9798

Answers (2)

mrammah
mrammah

Reputation: 225

My approach was very simple thanks to Anand pointing out the "small file problem." So my problem was that I could not extract ~ 2 million json files each ~10KB in size. So there was no way I was able to read then store them in parquet format as an intermediary step. I was given an s3 bucket with raw json files scraped from the web.

At any rate, using python's zipfile module came in hand. It was used in order to append multiple json files such that each one was at least 128MB and at most 1GB. Worked pretty well!

There is also another way you can do this using AWS Glue, but of course that requires IAM Role authorization and can be expensive, but the avantage of that is you can convert those files into parquet directly.

zipfile solution: https://docs.python.org/3/library/zipfile.html

AWS Glue solution: https://medium.com/searce/convert-csv-json-files-to-apache-parquet-using-aws-glue-a760d177b45f

Really good blog posts explaining the small file problem:

https://mungingdata.com/apache-spark/compacting-files/

https://garrens.com/blog/2017/11/04/big-data-spark-and-its-small-files-problem/?unapproved=252&moderation-hash=5a657350c6169448d65209caa52d5d2c#comment-252

Upvotes: 4

Anand Vidvat
Anand Vidvat

Reputation: 1059

Apache Spark is very good at handling large files but when you have tens of thousands of small files (millions in your case), in a directory/distributed in several directories, that will have a severe impact on processing time (potentially 10s of minutes to hours) since it has to read each of these tiny files.

An ideal file's size should be between 128 MB to 1GB in the disk, anything less than 128 MB (due spark.sql.files.maxPartitionBytes) file would case this Tiny Files problem and will be the bottleneck.

you can rewrite the data in parquet format at an intermediate location as one large file using coalesce or multiple even-sized files using repartition

you can read the data from this intermediate location for further processing & this should prevent any such bottlenecks which come with Tiny Files problem.

Upvotes: 5

Related Questions