How to use to_json and from_json to eliminate nested structfields in pyspark dataframe?

Question

This solution in theory, works perfectly for what I need, which is to create a new copied version of a dataframe while excluding certain nested structfields. here is a minimally reproducible artifact of my issue:

>>> df.printSchema()
root
| -- big: array(nullable=true)
| | -- element: struct(containsNull=true)
| | | -- keep: string(nullable=true)
| | | -- delete: string(nullable=true)

which you can instantiate like such:

schema = StructType([StructField("big", ArrayType(StructType([
    StructField("keep", StringType()),
    StructField("delete", StringType())
])))])
df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)

My goal is to convert the dataframe (along with the values in the columns I want to keep) to one that excludes certain nested structs, like delete for example.

root
| -- big: array(nullable=true)
| | -- element: struct(containsNull=true)
| | | -- keep: string(nullable=true)

According to the solution I linked that tries to leverage pyspark.sql's to_json and from_json functions, it should be accomplishable with something like this:

new_schema = StructType([StructField("big", ArrayType(StructType([
             StructField("keep", StringType())
])))])

test_df = df.withColumn("big", to_json(col("big"))).withColumn("big", from_json(col("big"), new_schema))

>>> test_df.printSchema()
root
| -- big: struct(nullable=true)
| | -- big: array(nullable=true)
| | | -- element: struct(containsNull=true)
| | | | -- keep: string(nullable=true)

>>> test_df.show()
+----+
| big|
+----+
|null|
+----+

So either I'm not following his directions right, or it doesn't work. How do you do this without a udf?

Pyspark to_json documentation Pyspark from_json documentation

jxc · Accepted Answer

It should be working, you just need to adjust your new_schema to include metadata for the column 'big' only, not for the dataframe:

new_schema = ArrayType(StructType([StructField("keep", StringType())]))

test_df = df.withColumn("big", from_json(to_json("big"), new_schema))

How to use to_json and from_json to eliminate nested structfields in pyspark dataframe?

Answers (1)

Related Questions