Reputation: 325
I have a data frame with schema like below
root
|-- ts: timestamp (nullable = true)
|-- address_list: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- active: integer (nullable = true)
| | |-- address: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- street: string (nullable = true)
| | | | |-- city: long (nullable = true)
| | | | |-- state: integer (nullable = true)
Would like to add a new field street_2 to one of its nested column - address_list.address in between street and city.
Below is the expected schema
root
|-- ts: timestamp (nullable = true)
|-- address_list: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- active: integer (nullable = true)
| | |-- address: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- street: string (nullable = true)
| | | | |-- street_2: string (nullable = true)
| | | | |-- city: long (nullable = true)
| | | | |-- state: integer (nullable = true)
I did try using transform but that adds the street_2 field to address_list at the end
df
.withColumn("address_list",transform(col("address_list"), x => x.withField("street_2", lit(null).cast(string))))
root
|-- ts: timestamp (nullable = true)
|-- address_list: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- active: integer (nullable = true)
| | |-- address: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- street: string (nullable = true)
| | | | |-- city: long (nullable = true)
| | | | |-- state: integer (nullable = true)
| | |-- street_2: string (nullable = true)
where as I want it inside address, and inserted between street and city
Upvotes: 0
Views: 825
Reputation: 216
You can try this:
data.printSchema
val result = data.withColumn(
"person_details",
transform(col("person_details"), x => x.withField("person.details.age", lit(40))))
result.printSchema
root
|-- person_details: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- person: struct (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- details: struct (nullable = true)
| | | | |-- city: string (nullable = true)
| | | | |-- income: long (nullable = false)
root
|-- person_details: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- person: struct (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- details: struct (nullable = true)
| | | | |-- city: string (nullable = true)
| | | | |-- income: long (nullable = false)
| | | | |-- age: integer (nullable = false)
I took help from this post: https://medium.com/@fqaiser94/manipulating-nested-data-just-got-easier-in-apache-spark-3-1-1-f88bc9003827
Upvotes: 2