Reputation: 531
My goal is to add a configurable constant value to a given column of a DataFrame.
val df = Seq(("A", 1), ("B", 2), ("C", 3)).toDF("col1", "col2")
+----+----+
|col1|col2|
+----+----+
| A| 1|
| B| 2|
| C| 3|
+----+----+
To do so, I can define a UDF with a hard-coded number, as the following:
val add100 = udf( (x: Int) => x + 100)
df.withColumn("col3", add100($"col2")).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| 1| 101|
| B| 2| 102|
| C| 3| 103|
+----+----+----+
My question is, what's the best way to make the number (100 above) configurable?
I have tried the following way and it seems to work. But I was wondering is there any other better way to achieve the same operational result?
val addP = udf( (x: Int, p: Int) => x + p )
df.withColumn("col4", addP($"col2", lit(100)))
+----+----+----+
|col1|col2|col4|
+----+----+----+
| A| 1| 101|
| B| 2| 102|
| C| 3| 103|
+----+----+----+
Upvotes: 4
Views: 12878
Reputation: 24188
We don't need an udf here:
df.withColumn("col3", df("col2") + 100).show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| 1| 101|
| B| 2| 102|
| C| 3| 103|
+----+----+----+
Upvotes: 11
Reputation: 214957
You may define a curried function, pull extra parameters out and return a udf that takes only columns as parameters:
val addP = (p: Int) => udf( (x: Int) => x + p )
// addP: Int => org.apache.spark.sql.expressions.UserDefinedFunction = <function1>
df.withColumn("col3", addP(100)($"col2")).show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| 1| 101|
| B| 2| 102|
| C| 3| 103|
+----+----+----+
Upvotes: 12