John
John

Reputation: 35

Applying a structure-preserving UDF to a column of structs in a dataframe

I have the schema

 |-- count: struct (nullable = true)
 |    |-- day1: long (nullable = true)
 |    |-- day2: long (nullable = true)
 |    |-- day3: long (nullable = true)
 |    |-- day4: long (nullable = true)
 |-- key: string (nullable = true)

and I would like to do a transformation on the data such that the structure of count is preserved, i.e., it still has four fields (day1, day2,...) of type long. The transformation I'd like to do is add the value of the day1 field to the other fields. My idea was to use a UDF but I'm not sure how 1) to have the UDF return a struct with the same structure and 2) how, within the UDF, to access the fields of the struct that it's transforming (in order to get the value of the day1 field). The logic for the UDF should be simple, something like

s : StructType => StructType(s.day1, s.day1+s.day2, s.day1+s.day3,s.day1+s.day4)

but I don't know how to get the correct types/preserve the field names of the original structure. I'm very new to Spark so any guidance is much appreciated.

Also, I would greatly appreciate it if anyone could point me to the right documentation for this type of thing. I feel that this type of simple transformation should be very simple but I was reading the Spark docs and it wasn't clear to me how this is done.

Upvotes: 0

Views: 438

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35229

I wouldn't use udf. Just select / withColumn

import org.apache.spark.sql.functions._
import spark.implicits._

df.withColumn("count", 
  struct(
    $"count.day1".alias("day1"),
    ($"count.day1" + $"count.day2").alias("day2"), 
    ($"count.day1" + $"count.day3").alias("day3"),
    ($"count.day1" + $"count.day4").alias("day4")))

Upvotes: 1

Related Questions