avoid splitting json output by pyspark (v. 2.1)

Question

using spark v2.1 and python, I load json files with

sqlContext.read.json("path/data.json")

I have problem with output json. Using the below command

df.write.json("path/test.json")

data is saved in a folder called test.json (not a file) which includes two empty files: one empty and the other with a strange name:

 part-r-00000-f9ec958d-ceb2-4aee-bcb1-fa42a95b714f

Is there anyway to have a clean single json output file?

thanks

koiralo · Accepted Answer

Yes, spark writes the output in multiple file when you try to save. Since the computation is distributed the output files are written in multiples part files like (part-r-00000-f9ec958d-ceb2-4aee-bcb1-fa42a95b714f). The number of files created are equal to the number of partition.

If your data is small and can fits in the memory then you can save your output file in a single file. But if your data is large saving on a single file is not the suggested way.

Actually the test.json is a directory and not a json file. It contains multiple part files inside it. This does not create any problem for you you can easily read this later.

If you still want your output in a single file then you need to repartition to 1, which brings your all data to single node and saves. This may cause issue if you have large data.

df.repartition(1).write.json("path/test.json")

Or

df.collect().write.json("path/test.json")

avoid splitting json output by pyspark (v. 2.1)

Answers (1)

Related Questions