Jas
Jas

Reputation: 15103

In pyspark how to convert rdd to json with a different scheme?

How to convert the below code to write output json with pyspark DataFrame using, df2.write.format('json')

  1. I have an input list (for sake of example only a few items).
  2. Want to write a json which is more complex/nested than input.
  3. I tried using rdd.map
  4. Problem: Output contains apostrophes for each object in json.
  5. I cannot just string replace because data itself might contain it.
  6. If there is a better way to convert the scheme to the nested json with DataFrame in the example below can you show how? as this might resolve the aposthrophe alltogether.

Here is what I tried:

import json 

rdd = sc.parallelize([(1,2,3),(4,5,6),(7,8,9)])
df = rdd.toDF(["a","b","c"])
rddToJson = df.rdd.map(lambda x: json.dumps({"some_top_level_1": {"mycolumn1": x.a}})) // note that result json is complex and more nested than input
rddToJson.collect()

result: contains apostrophes (cannot replace, it can appear anywhere in values) how to do it with proper scheme and dataframe and then df.json.write?

result:

Out[20]: 
['{"some_top_level_1": {"mycolumn1": 1}}',
 '{"some_top_level_1": {"mycolumn1": 4}}',
 '{"some_top_level_1": {"mycolumn1": 7}}']

My target (unless can be done in another way) is to use df.write.format('json') in order to write the nested/complex json from the above input.

PS: I saw this interesting post: https://medium.com/@mrpowers/adding-structtype-columns-to-spark-dataframes-b44125409803 but as i'm a newbie I was not sure how could I transform the input that I have into that nested scheme that I need on output.

Upvotes: 1

Views: 5303

Answers (1)

Manoj Singh
Manoj Singh

Reputation: 1737

You can use the struct function to create a nested dataframe from the flat schema.

import json 

rdd = sc.parallelize([(1,2,3),(4,5,6),(7,8,9)])
df = rdd.toDF(["a","b","c"])

df2 = df.withColumn("some_top_level_1", struct(col("a").alias("my_column1"))).select("some_top_level_1")
df2.coalesce(1).write.mode("overwrite").json("test.json")

Upvotes: 2

Related Questions