Venkata
Venkata

Reputation: 325

How to add new field to two levels nested struct column

I have a data frame with schema like below

 root
     |-- ts: timestamp (nullable = true)
     |-- address_list: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- id: string (nullable = true)
     |    |    |-- active: integer (nullable = true)
     |    |    |-- address: array (nullable = true)
     |    |    |    |-- element: struct (containsNull = true)
     |    |    |    |    |-- street: string (nullable = true)
     |    |    |    |    |-- city: long (nullable = true)
     |    |    |    |    |-- state: integer (nullable = true)

Would like to add a new field street_2 to one of its nested column - address_list.address in between street and city.

Below is the expected schema

 root
     |-- ts: timestamp (nullable = true)
     |-- address_list: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- id: string (nullable = true)
     |    |    |-- active: integer (nullable = true)
     |    |    |-- address: array (nullable = true)
     |    |    |    |-- element: struct (containsNull = true)
     |    |    |    |    |-- street: string (nullable = true)
     |    |    |    |    |-- street_2: string (nullable = true)
     |    |    |    |    |-- city: long (nullable = true)
     |    |    |    |    |-- state: integer (nullable = true)

I did try using transform but that adds the street_2 field to address_list at the end

df
.withColumn("address_list",transform(col("address_list"), x => x.withField("street_2", lit(null).cast(string))))

 root
     |-- ts: timestamp (nullable = true)
     |-- address_list: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- id: string (nullable = true)
     |    |    |-- active: integer (nullable = true)
     |    |    |-- address: array (nullable = true)
     |    |    |    |-- element: struct (containsNull = true)
     |    |    |    |    |-- street: string (nullable = true)
     |    |    |    |    |-- city: long (nullable = true)
     |    |    |    |    |-- state: integer (nullable = true)
     |    |    |-- street_2: string (nullable = true)

where as I want it inside address, and inserted between street and city

Upvotes: 0

Views: 825

Answers (1)

Pradeep yadav
Pradeep yadav

Reputation: 216

You can try this:


data.printSchema

val result = data.withColumn(
  "person_details", 
  transform(col("person_details"), x => x.withField("person.details.age", lit(40))))

result.printSchema

root
 |-- person_details: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- person: struct (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- details: struct (nullable = true)
 |    |    |    |    |-- city: string (nullable = true)
 |    |    |    |    |-- income: long (nullable = false)

root
 |-- person_details: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- person: struct (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- details: struct (nullable = true)
 |    |    |    |    |-- city: string (nullable = true)
 |    |    |    |    |-- income: long (nullable = false)
 |    |    |    |    |-- age: integer (nullable = false)

I took help from this post: https://medium.com/@fqaiser94/manipulating-nested-data-just-got-easier-in-apache-spark-3-1-1-f88bc9003827

Upvotes: 2

Related Questions