scalacode
scalacode

Reputation: 1106

Update nested struct in spark dataset from another struct column

I have the following spark dataset with a nested struct type:

-- _1: struct (nullable = false)
 |    |-- _1: struct (nullable = false)
 |    |    |-- _1: struct (nullable = false)
 |    |    |    |-- ni_number: string (nullable = true)
 |    |    |    |-- national_registration_number: string (nullable = true)
 |    |    |    |-- id_issuing_country: string (nullable = true)
 |    |    |    |-- doc_type_name: string (nullable = true)
 |    |    |    |-- brand: string (nullable = true)
 |    |    |    |-- company_name: string (nullable = true)
 |    |    |-- _2: struct (nullable = true)
 |    |    |    |-- municipality: string (nullable = true)
 |    |    |    |-- country: string (nullable = true)
 |    |-- _2: struct (nullable = true)
 |    |    |-- brand_name: string (nullable = true)
 |    |    |-- puk: string (nullable = true)
 |-- _2: struct (nullable = true)
 |    |-- customer_servicesegment: string (nullable = true)
 |    |-- customer_category: string (nullable = true)

my aim here is to do some flattening at the bottom of the structype and have this target schema:

-- _1: struct (nullable = false)
|    |-- _1: struct (nullable = false)
|    |    |-- _1: struct (nullable = false)
|    |    |    |-- ni_number: string (nullable = true)
|    |    |    |-- national_registration_number: string (nullable = true)
|    |    |    |-- id_issuing_country: string (nullable = true)
|    |    |    |-- doc_type_name: string (nullable = true)
|    |    |    |-- brand: string (nullable = true)
|    |    |    |-- company_name: string (nullable = true)
|    |    |-- _2: struct (nullable = true)
|    |    |    |-- municipality: string (nullable = true)
|    |    |    |-- country: string (nullable = true)
|    |-- _2: struct (nullable = true)
|    |    |-- brand_name: string (nullable = true)
|    |    |-- puk: string (nullable = true)
|    |-- _3: struct (nullable = true)
|    |    |-- customer_servicesegment: string (nullable = true)
|    |    |-- customer_category: string (nullable = true)

the part of the schema with the columns (customer_servicesegment, customer_category) should be at the same level as the one with the cols (brand_name, puk)

So here explode utility from spark sql can be used but I don't know where to put it

any help with this please

Upvotes: 1

Views: 445

Answers (1)

blackbishop
blackbishop

Reputation: 32710

If you have Spark 3.1+, you can use withField column method to update the the struct _1 like this:

val df2 = df.withColumn("_1", col("_1").withField("_3", col("_2"))).drop("_2")

This adds the column _2 as new field named _3 into the struct _1 then drops the column _2 for first level.


For older versions, you need to reconstruct the struct column _1:

val df2 = df.withColumn(
  "_1", 
  struct(col("_1._1").as("_1"), col("_1._2").as("_2"), col("_2").as("_3"))
).drop("_2")

Upvotes: 1

Related Questions