PySpark save DataFrame to actual JSON file

Question

How can I save a PySpark DataFrame to a real JSON file?

Following documentation, I have tried

df.write.json('myfile.json')

It works, but it saves the file as a series of dictionaries, one per line and this does not get read properly by a

import json
d = json.load(open('myfile.json'))

I would like the file to contain a list of dictionaries. Is there a way?

zero323 · Accepted Answer

It there a way to do it? Not really, or at least not in an elegant way. You could convert data to Python RDD, compute partition statistics, and build complete document manually but it looks like a waste of time.

If you want to get a list of dicts just parse files(-s) line by line:

with open('myfile.json') as fr:
    dicts = [json.loads(line) for line in fr]

PySpark save DataFrame to actual JSON file

Answers (2)

Related Questions