Mantis
Mantis

Reputation: 124

Change Behaviour of Built-in Spark Sql Functions

Is there any way to prevent spark sql functions from nulling values?

For example I have the following dataframe

df.show

+--------------------+--------------+------+------------+
|               Title|Year Published|Rating|Length (Min)|
+--------------------+--------------+------+------------+
|      101 Dalmatians|       01/1996|     G|         103|
|101 Dalmatians (A...|          1961|     G|          79|
|101 Dalmations II...|          2003|     G|          70|

I want to apply spark sqls date_format function to Year Published column.

val sql = """date_format(`Year Published`, 'MM/yyyy')"""

val df2 = df.withColumn("Year Published", expr(sql))

df2.show

+--------------------+--------------+------+------------+
|               Title|Year Published|Rating|Length (Min)|
+--------------------+--------------+------+------------+
|      101 Dalmatians|          null|     G|         103|
|101 Dalmatians (A...|       01/1961|     G|          79|
|101 Dalmations II...|       01/2003|     G|          70|

The first row of the Year Published column has been nulled as the original value was in a different date format than the other dates.

This behaviour is not unique to date_format for example format_number will null non-numeric types.

With my dataset I expect different date formats and dirty data with unparseable values. I have a use case where if the value of a cell cannot be formatted then I want to return the current value as opposed to null.

Is there a way to make spark use the original value in df instead of null if the function for df2 cannot be applied correctly?

What I've tried

I've looked at wrapping Expressions in org.apache.spark.sql.catalyst.expressions but could not see a way to replace the existing functions.

The only working solution I could find is creating my own date_format and registering it as a udf but this isn't practical for all functions. I'm looking for a solution that will never return null if the input to a function is non-null or an automated way to wrap all existing spark functions.

Upvotes: 1

Views: 352

Answers (1)

Jack Leow
Jack Leow

Reputation: 22497

You could probably use the coalesce function for your purposes:

coalesce(date_format(`Year Published`, 'MM/yyyy'), `Year Published`)

Upvotes: 1

Related Questions