Reputation: 2913
I have a PySpark DataFrame with an array of structs, containing two columns (colorcode
and name
). I want to add a new column to the struct, newcol
.
This question answered "how to add a column to a nested struct", but I'm failing to transfer it to my case, where the struct is further nested inside an array. I can't seem to reference/recreate the array-struct schema.
My schema:
|-- Id: string (nullable = true)
|-- values: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Dep: long (nullable = true)
| | |-- ABC: string (nullable = true)
What is should become:
|-- Id: string (nullable = true)
|-- values: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Dep: long (nullable = true)
| | |-- ABC: string (nullable = true)
| | |-- newcol: string (nullable = true)
How do I transfer the solution to my nested struct?
Reproducible code to get a df of the above schema:
data = [
(10, [{"Dep": 10, "ABC": 1}, {"Dep": 10, "ABC": 1}]),
(20, [{"Dep": 20, "ABC": 1}, {"Dep": 20, "ABC": 1}]),
(30, [{"Dep": 30, "ABC": 1}, {"Dep": 30, "ABC": 1}]),
(40, [{"Dep": 40, "ABC": 1}, {"Dep": 40, "ABC": 1}])
]
myschema = StructType(
[
StructField("id", IntegerType(), True),
StructField("values",
ArrayType(
StructType([
StructField("Dep", StringType(), True),
StructField("ABC", StringType(), True)
])
))
]
)
df = spark.createDataFrame(data=data, schema=myschema)
df.printSchema()
df.show(10, False)
Upvotes: 3
Views: 3662
Reputation: 26676
Another way, of doing it would be using sql expressions.
df = df.withColumn("values",F.expr("transform(values, x -> struct(COALESCE('1') as newcol,x.Dep,x.ABC))"))
Upvotes: 1
Reputation: 4189
For spark version >= 3.1, you can use the transform
function and withField
method to achieve this.
transform
performs the transformation calculation according to the provided function for each element (struct(Dep, ABC) here) in the array
(values
column here). withField
adds/replaces a field in StructType by name.
df = df.withColumn('values', F.transform('values', lambda x: x.withField('newcol', F.lit(1))))
Upvotes: 7