Reputation: 23
I am trying to define a new column in a Spark DataFrame using a constant defined as a var
. I'm using Zeppelin - in the initial cell, it starts with
%spark
import org.apache.spark.sql.functions._
var year : Int = 2016
spark.read.parquet("<path/to/file>")
The file contains a column named birth_year; I want to create a new column named age
defined as $year - birth_year
, where birth_year
is a string column. I'm not quite clear on how to do this when the input argument to a UDF
is a parameter. I've done a couple hours of searching and created a UDF
, but I got an error message whose principal part is
<console>:71: error: type mismatch;
found : Int
required: org.apache.spark.sql.Column
spark.read.parquet("path/to/file").withColumn("birth_year", $"birth_year" cast "Int").withColumn("age", createAge(year, col("birth_year"))).createOrReplaceTempView("tmp")
and a caret directly under 'year'.
I suspect that $year
does not map into a variable of the same length as birth_year
; I've seen the lit()
function that appears to work for strings - does it work with integer values as well, or is there another function for this purpose?
I tried the following:
%spark
import org.apache.spark.sql.functions._
var year : Int = 2016
def createAge = udf((yr : Int, dob : Int) => {yr - dob})
spark.read.parquet("<path/to/file>").withColumn("birth_year", $"birth_year" cast "Int").withColumn("age", createAge($"year", col("birth_year"))).createOrReplaceTempView("tmp")
Any suggestions welcome - thanks in advance for any help.
Upvotes: 2
Views: 4767
Reputation: 28392
You can't use year
directly as an input to the UDF
since a it expects columns to operate on. To create a column with a constant value use lit()
. You can call the UDF
as follows:
df.withColumn("age", createAge(lit(year), $"birth_year".cast("int")))
However, when possible it's always preferred to use the in-built functions in Spark when possible. In this case, you do not need an UDF
. Simply do:
df.withColumn("age", lit(year) - $"birth_year".cast("int"))
This should be much faster.
Upvotes: 4