demur
demur

Reputation: 23

Create a new column in a Spark DataFrame using a var with constant value

I am trying to define a new column in a Spark DataFrame using a constant defined as a var. I'm using Zeppelin - in the initial cell, it starts with

%spark
import org.apache.spark.sql.functions._
var year : Int = 2016
spark.read.parquet("<path/to/file>")

The file contains a column named birth_year; I want to create a new column named age defined as $year - birth_year, where birth_year is a string column. I'm not quite clear on how to do this when the input argument to a UDF is a parameter. I've done a couple hours of searching and created a UDF, but I got an error message whose principal part is

<console>:71: error: type mismatch;
 found   : Int
 required: org.apache.spark.sql.Column
       spark.read.parquet("path/to/file").withColumn("birth_year", $"birth_year" cast "Int").withColumn("age", createAge(year, col("birth_year"))).createOrReplaceTempView("tmp")

and a caret directly under 'year'.

I suspect that $year does not map into a variable of the same length as birth_year; I've seen the lit() function that appears to work for strings - does it work with integer values as well, or is there another function for this purpose?

I tried the following:

%spark
import org.apache.spark.sql.functions._
var year : Int = 2016
def createAge = udf((yr : Int, dob : Int) => {yr - dob})
spark.read.parquet("<path/to/file>").withColumn("birth_year", $"birth_year" cast "Int").withColumn("age", createAge($"year", col("birth_year"))).createOrReplaceTempView("tmp")

Any suggestions welcome - thanks in advance for any help.

Upvotes: 2

Views: 4767

Answers (1)

Shaido
Shaido

Reputation: 28392

You can't use year directly as an input to the UDF since a it expects columns to operate on. To create a column with a constant value use lit(). You can call the UDF as follows:

df.withColumn("age", createAge(lit(year), $"birth_year".cast("int")))

However, when possible it's always preferred to use the in-built functions in Spark when possible. In this case, you do not need an UDF. Simply do:

df.withColumn("age", lit(year) - $"birth_year".cast("int"))

This should be much faster.

Upvotes: 4

Related Questions