Is there a way to prevent PySpark from creating several small files when writing a DataFrame to JSON file? If I run: df.write.format('json').save('myfile.json') or df1.write.json('myfile.json') it creates the folder named myfile and within it I find several small files named part-*** , the HDFS way. Is it by any means possible to have it spit out a single file instead?

pythonamazon-s3apache-sparkpysparkapache-spark-sql

Reputation: 9804

PySpark: spit out single file when writing instead of multiple part files

Is there a way to prevent PySpark from creating several small files when writing a DataFrame to JSON file?

If I run:

 df.write.format('json').save('myfile.json')

df1.write.json('myfile.json')

it creates the folder named myfile and within it I find several small files named part-***, the HDFS way. Is it by any means possible to have it spit out a single file instead?

Upvotes: 9

Answers (3)

Zahiduzzaman

Reputation: 197

This was a better solution for me.

rdd.map(json.dumps) .saveAsTextFile(json_lines_file_name)

Upvotes: 0

the.malkolm

Reputation: 2421

Well, the answer to your exact question is coalesce function. But as already mentioned it is not efficient at all as it will force one worker to fetch all data and write it sequentially.

df.coalesce(1).write.format('json').save('myfile.json')

P.S. Btw, the result file is not a valid json file. It is a file with a json object per line.

Upvotes: 12

J Maurer

Reputation: 1044

df1.rdd.repartition(1).write.json('myfile.json')

Would be nice, but isn't available. Check this related question. https://stackoverflow.com/a/33311467/2843520

Upvotes: -2

PySpark: spit out single file when writing instead of multiple part files

Answers (3)

Related Questions