Cribber
Cribber

Reputation: 2913

Add a column to a struct nested in an array

I have a PySpark DataFrame with an array of structs, containing two columns (colorcode and name). I want to add a new column to the struct, newcol.

This question answered "how to add a column to a nested struct", but I'm failing to transfer it to my case, where the struct is further nested inside an array. I can't seem to reference/recreate the array-struct schema.

My schema:

 |-- Id: string (nullable = true)
 |-- values: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Dep: long (nullable = true)
 |    |    |-- ABC: string (nullable = true)

What is should become:

 |-- Id: string (nullable = true)
 |-- values: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Dep: long (nullable = true)
 |    |    |-- ABC: string (nullable = true)
 |    |    |-- newcol: string (nullable = true)

How do I transfer the solution to my nested struct?

Reproducible code to get a df of the above schema:

data = [
    (10, [{"Dep": 10, "ABC": 1}, {"Dep": 10, "ABC": 1}]),
    (20, [{"Dep": 20, "ABC": 1}, {"Dep": 20, "ABC": 1}]),
    (30, [{"Dep": 30, "ABC": 1}, {"Dep": 30, "ABC": 1}]),
    (40, [{"Dep": 40, "ABC": 1}, {"Dep": 40, "ABC": 1}])
  ]
myschema = StructType(
[
    StructField("id", IntegerType(), True),
    StructField("values",
                ArrayType(
                    StructType([
                        StructField("Dep", StringType(), True),
                        StructField("ABC", StringType(), True)
                    ])
    ))
]
)
df = spark.createDataFrame(data=data, schema=myschema)
df.printSchema()
df.show(10, False)

Upvotes: 3

Views: 3662

Answers (2)

wwnde
wwnde

Reputation: 26676

Another way, of doing it would be using sql expressions.

df = df.withColumn("values",F.expr("transform(values, x -> struct(COALESCE('1') as newcol,x.Dep,x.ABC))"))

Upvotes: 1

过过招
过过招

Reputation: 4189

For spark version >= 3.1, you can use the transform function and withField method to achieve this.

transform performs the transformation calculation according to the provided function for each element (struct(Dep, ABC) here) in the array (values column here). withField adds/replaces a field in StructType by name.

df = df.withColumn('values', F.transform('values', lambda x: x.withField('newcol', F.lit(1))))

Upvotes: 7

Related Questions