Reputation: 4132
I'm learning PySpark
. I load a CSV
file into a Spark DataFrame
. After that I do some data transformation. Finally, I want some set of columns to be put into a nested structure and then save it in JSON
format.
This is the sample code that I have:
df = spark.createDataFrame([("Bilbo Baggins", 50), ("Gandalf", 32), ("Thorin", 19),
("Balin", 18), ("Kili", 37),("Dwalin", 19), ("Oin", 46),
("Gloin", 28), ("Fili", 22)], ["name", "age"])
mod_df = df.select(struct([col(x) for x in df.columns[:]]).alias("PersonalDetails"))
When I save this as a JSON file, it looks like:
{
"PersonalDetails" :
{
"name" : "Balin",
"age" : 18
}
}
{
"PersonalDetails" :
{
"name" : "Gloin",
"age" : 28
}
}
As you can see, they come up as separate documents. However I want them to come up in a single document with array format, like:
{
"PersonalDetails" :[
{
"name" : "Balin",
"age" : 18
},
{
"name" : "Gloin",
"age" : 28
}
]
}
Can you guys help me in where I'm doing it wrong? Thank you :)
Upvotes: 0
Views: 1848
Reputation: 4719
from pyspark.sql import functions as F
mod_df = df.select(
F.struct(df.columns).alias('PersonalDetails')
).select(
F.collect_list('PersonalDetails').alias('PersonalDetails')
)
Upvotes: 2