Reputation: 9804
How can I save a PySpark DataFrame to a real JSON file?
Following documentation, I have tried
df.write.json('myfile.json')
It works, but it saves the file as a series of dictionaries, one per line and this does not get read properly by a
import json
d = json.load(open('myfile.json'))
I would like the file to contain a list of dictionaries. Is there a way?
Upvotes: 2
Views: 5191
Reputation: 189
You could do also something like
from pyspark.sql.functions import get_json_object
df=sc.textFile("path/to/file").toDF(["col"])
df.select(get.json.object("col", "$").alias("list_of_dictionaries"))
df.list_of_dictionaries
It returns a column object that you could transform into python list
Upvotes: 1
Reputation: 330093
It there a way to do it? Not really, or at least not in an elegant way. You could convert data to Python RDD, compute partition statistics, and build complete document manually but it looks like a waste of time.
If you want to get a list
of dicts
just parse files(-s) line by line:
with open('myfile.json') as fr:
dicts = [json.loads(line) for line in fr]
Upvotes: 2