martina.physics
martina.physics

Reputation: 9804

PySpark save DataFrame to actual JSON file

How can I save a PySpark DataFrame to a real JSON file?

Following documentation, I have tried

df.write.json('myfile.json')

It works, but it saves the file as a series of dictionaries, one per line and this does not get read properly by a

import json
d = json.load(open('myfile.json'))

I would like the file to contain a list of dictionaries. Is there a way?

Upvotes: 2

Views: 5191

Answers (2)

Xavi
Xavi

Reputation: 189

You could do also something like

from pyspark.sql.functions import get_json_object
df=sc.textFile("path/to/file").toDF(["col"])
df.select(get.json.object("col", "$").alias("list_of_dictionaries"))
df.list_of_dictionaries

It returns a column object that you could transform into python list

Upvotes: 1

zero323
zero323

Reputation: 330093

It there a way to do it? Not really, or at least not in an elegant way. You could convert data to Python RDD, compute partition statistics, and build complete document manually but it looks like a waste of time.

If you want to get a list of dicts just parse files(-s) line by line:

with open('myfile.json') as fr:
    dicts = [json.loads(line) for line in fr]

Upvotes: 2

Related Questions