Transform PySpark DataFrame into a nested structure

Question

I'm learning PySpark. I load a CSV file into a Spark DataFrame. After that I do some data transformation. Finally, I want some set of columns to be put into a nested structure and then save it in JSON format.

This is the sample code that I have:

df = spark.createDataFrame([("Bilbo Baggins",  50), ("Gandalf", 32), ("Thorin", 19), 
                                ("Balin", 18), ("Kili", 37),("Dwalin", 19), ("Oin", 46), 
                                ("Gloin", 28), ("Fili", 22)], ["name", "age"])

mod_df = df.select(struct([col(x) for x in df.columns[:]]).alias("PersonalDetails"))

When I save this as a JSON file, it looks like:

{
        "PersonalDetails" :
                {
                        "name" : "Balin",
                        "age" : 18
                }
}
{
        "PersonalDetails" :
                {
                        "name" : "Gloin",
                        "age" : 28
                }
}

As you can see, they come up as separate documents. However I want them to come up in a single document with array format, like:

{
        "PersonalDetails" :[
                {
                        "name" : "Balin",
                        "age" : 18
                },
                {
                        "name" : "Gloin",
                        "age" : 28
                }
        ]
}

Can you guys help me in where I'm doing it wrong? Thank you :)

Transform PySpark DataFrame into a nested structure

Answers (1)

Related Questions