Sivaprasanna Sethuraman
Sivaprasanna Sethuraman

Reputation: 4132

Transform PySpark DataFrame into a nested structure

I'm learning PySpark. I load a CSV file into a Spark DataFrame. After that I do some data transformation. Finally, I want some set of columns to be put into a nested structure and then save it in JSON format.

This is the sample code that I have:

df = spark.createDataFrame([("Bilbo Baggins",  50), ("Gandalf", 32), ("Thorin", 19), 
                                ("Balin", 18), ("Kili", 37),("Dwalin", 19), ("Oin", 46), 
                                ("Gloin", 28), ("Fili", 22)], ["name", "age"])

mod_df = df.select(struct([col(x) for x in df.columns[:]]).alias("PersonalDetails"))

When I save this as a JSON file, it looks like:

{
        "PersonalDetails" :
                {
                        "name" : "Balin",
                        "age" : 18
                }
}
{
        "PersonalDetails" :
                {
                        "name" : "Gloin",
                        "age" : 28
                }
}

As you can see, they come up as separate documents. However I want them to come up in a single document with array format, like:

{
        "PersonalDetails" :[
                {
                        "name" : "Balin",
                        "age" : 18
                },
                {
                        "name" : "Gloin",
                        "age" : 28
                }
        ]
}

Can you guys help me in where I'm doing it wrong? Thank you :)

Upvotes: 0

Views: 1848

Answers (1)

Zhang Tong
Zhang Tong

Reputation: 4719

from pyspark.sql import functions as F

mod_df = df.select(
    F.struct(df.columns).alias('PersonalDetails')
).select(
    F.collect_list('PersonalDetails').alias('PersonalDetails')
)

Upvotes: 2

Related Questions