Reputation: 15103
How to convert the below code to write output json with pyspark DataFrame using, df2.write.format('json')
rdd.map
DataFrame
in the example below can you show how? as this might resolve the aposthrophe alltogether.Here is what I tried:
import json
rdd = sc.parallelize([(1,2,3),(4,5,6),(7,8,9)])
df = rdd.toDF(["a","b","c"])
rddToJson = df.rdd.map(lambda x: json.dumps({"some_top_level_1": {"mycolumn1": x.a}})) // note that result json is complex and more nested than input
rddToJson.collect()
result: contains apostrophes (cannot replace, it can appear anywhere in values) how to do it with proper scheme and dataframe and then df.json.write?
result:
Out[20]:
['{"some_top_level_1": {"mycolumn1": 1}}',
'{"some_top_level_1": {"mycolumn1": 4}}',
'{"some_top_level_1": {"mycolumn1": 7}}']
My target (unless can be done in another way) is to use df.write.format('json') in order to write the nested/complex json from the above input.
PS: I saw this interesting post: https://medium.com/@mrpowers/adding-structtype-columns-to-spark-dataframes-b44125409803 but as i'm a newbie I was not sure how could I transform the input that I have into that nested scheme that I need on output.
Upvotes: 1
Views: 5303
Reputation: 1737
You can use the struct function to create a nested dataframe from the flat schema.
import json
rdd = sc.parallelize([(1,2,3),(4,5,6),(7,8,9)])
df = rdd.toDF(["a","b","c"])
df2 = df.withColumn("some_top_level_1", struct(col("a").alias("my_column1"))).select("some_top_level_1")
df2.coalesce(1).write.mode("overwrite").json("test.json")
Upvotes: 2