Reputation: 3845

Generate single json file for pyspark RDD

I am building a Python script in which I need to generate a json file from json RDD . Following is code snippet for saving json file.

jsonRDD.map(lambda x :json.loads(x))
.coalesce(1, shuffle=True).saveAsTextFile('examples/src/main/resources/demo.json')

But I need to write json data to a single file instead of data distributed across several partitions.

So please suggest me appropriate solution for it

Upvotes: 2

Answers (2)

vkoe

Reputation: 381

Without the use of additional libraries like pandas, you could save your RDD of several jsons by reducing them to one big string of jsons, each separated by a new line:

# perform your operation
# note that you do not need a lambda expression for json.loads
jsonRDD = jsonRDD.map(json.loads).coalesce(1, shuffle=True)

# map jsons back to string
jsonRDD = jsonRDD.map(json.dumps)

# reduce to one big string with one json on each line
json_string = jsonRDD.reduce(lambda x, y: x + "\n" + y)

# write your string to a file
with open("path/to/your.json", "w") as f:
    f.write(json_string.encode("utf-8"))

Upvotes: 1

Jared

Reputation: 2954

I have had issues with pyspark saving off JSON files once I have them in a RDD or dataframe, so what I do is convert them to a pandas dataframe and save them to a non distributed directory.

import pandas

df1 = sqlContext.createDataFrame(yourRDD)
df2 = df1.toPandas()
df2.to_json(yourpath)

Upvotes: 0

Generate single json file for pyspark RDD

Answers (2)

Related Questions