Reputation: 738
I have a pyspark dataframe with column "Student".
One entry of data is as follows:
{
"Student" : {
"m" : {
"name" : {"s" : "john"},
"score": {"s" : "165"}
}
}
}
I want to change the schema of this column, so that the entry looks as follows:
{
"Student" :
{
"m" :
{
"StudentDetails" :
{
"m" :
{
"name" : {"s" : "john"},
"score": {"s" : "165"}
}
}
}
}
}
The problem is that the Student field can also be null in the dataframe. So I want to retain the null values but change the schema of not null values. I have used a udf for the above process which works.
def Helper_ChangeSchema(row):
#null check
if row is None:
return None
#change schema
data = row.asDict(True)
return {"m":{"StudentDetails":data}}
but udf is a black box for spark. Is there any method to do the same using inbuilt spark functions or sql queries.
Upvotes: 0
Views: 5338
Reputation: 14905
It works exactly like in this answer. Just add another nested level in the struct:
Either as SQL expression:
processedDf = df.withColumn("student", F.expr("named_struct('m', named_struct('student_details', student))"))
or in Python code using the struct function:
processedDf = df.withColumn("student", F.struct(F.struct(F.col("student")).alias('m')))
Both versions have the same result:
root
|-- student: struct (nullable = false)
| |-- m: struct (nullable = false)
| | |-- student_details: struct (nullable = true)
| | | |-- m: struct (nullable = true)
| | | | |-- name: struct (nullable = true)
| | | | | |-- s: string (nullable = true)
| | | | |-- score: struct (nullable = true)
| | | | | |-- s: string (nullable = true)
Both approaches work also fine with empty rows. Using this input data
data ='{"student" : {"m" : {"name" : {"s" : "john"},"score": {"s" : "165"}}}}'
data2='{"student": null }'
df = spark.read.json(sc.parallelize([data, data2]))
processedDf.show(truncate=False)
prints
+---------------------+
|student |
+---------------------+
|[[[[[john], [165]]]]]|
|[[]] |
+---------------------+
processedDf = df.withColumn("student", F.when(F.col("student").isNull(), F.lit(None)).otherwise(F.struct(F.struct(F.col("student")).alias('m'))))
This will result in the same schema, but a different output for the null row:
+---------------------+
|student |
+---------------------+
|[[[[[john], [165]]]]]|
|null |
+---------------------+
Upvotes: 1