Reputation: 4099
This solution in theory, works perfectly for what I need, which is to create a new copied version of a dataframe while excluding certain nested structfields. here is a minimally reproducible artifact of my issue:
>>> df.printSchema()
root
| -- big: array(nullable=true)
| | -- element: struct(containsNull=true)
| | | -- keep: string(nullable=true)
| | | -- delete: string(nullable=true)
which you can instantiate like such:
schema = StructType([StructField("big", ArrayType(StructType([
StructField("keep", StringType()),
StructField("delete", StringType())
])))])
df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)
My goal is to convert the dataframe (along with the values in the columns I want to keep) to one that excludes certain nested structs, like delete
for example.
root
| -- big: array(nullable=true)
| | -- element: struct(containsNull=true)
| | | -- keep: string(nullable=true)
According to the solution I linked that tries to leverage pyspark.sql's to_json
and from_json
functions, it should be accomplishable with something like this:
new_schema = StructType([StructField("big", ArrayType(StructType([
StructField("keep", StringType())
])))])
test_df = df.withColumn("big", to_json(col("big"))).withColumn("big", from_json(col("big"), new_schema))
>>> test_df.printSchema()
root
| -- big: struct(nullable=true)
| | -- big: array(nullable=true)
| | | -- element: struct(containsNull=true)
| | | | -- keep: string(nullable=true)
>>> test_df.show()
+----+
| big|
+----+
|null|
+----+
So either I'm not following his directions right, or it doesn't work. How do you do this without a udf?
Pyspark to_json documentation Pyspark from_json documentation
Upvotes: 1
Views: 4567
Reputation: 13998
It should be working, you just need to adjust your new_schema to include metadata for the column 'big' only, not for the dataframe:
new_schema = ArrayType(StructType([StructField("keep", StringType())]))
test_df = df.withColumn("big", from_json(to_json("big"), new_schema))
Upvotes: 1