Reputation: 3845
I am building a Python script in which I need to generate a json file from json RDD . Following is code snippet for saving json file.
jsonRDD.map(lambda x :json.loads(x))
.coalesce(1, shuffle=True).saveAsTextFile('examples/src/main/resources/demo.json')
But I need to write json data to a single file instead of data distributed across several partitions.
So please suggest me appropriate solution for it
Upvotes: 2
Views: 9837
Reputation: 381
Without the use of additional libraries like pandas, you could save your RDD of several jsons by reducing them to one big string of jsons, each separated by a new line:
# perform your operation
# note that you do not need a lambda expression for json.loads
jsonRDD = jsonRDD.map(json.loads).coalesce(1, shuffle=True)
# map jsons back to string
jsonRDD = jsonRDD.map(json.dumps)
# reduce to one big string with one json on each line
json_string = jsonRDD.reduce(lambda x, y: x + "\n" + y)
# write your string to a file
with open("path/to/your.json", "w") as f:
f.write(json_string.encode("utf-8"))
Upvotes: 1
Reputation: 2954
I have had issues with pyspark saving off JSON files once I have them in a RDD or dataframe, so what I do is convert them to a pandas dataframe and save them to a non distributed directory.
import pandas
df1 = sqlContext.createDataFrame(yourRDD)
df2 = df1.toPandas()
df2.to_json(yourpath)
Upvotes: 0