scala spark use udf function in spark shell for array manipulation in dataframe column

Question

scala/spark use udf function in spark shell for array manipulation in dataframe column

df.printSchema

root
|-- x: timestamp (nullable = true)
|-- date_arr: array (nullable = true)
|    |-- element: timestamp (containsNull = true)

sample data:

|x                      | date_arr                                                              |  
|---------------------- |---------------------------------------------------------------------- |  
| 2009-10-22 19:00:00.0 | [2009-08-22 19:00:00.0, 2009-09-19 19:00:00.0, 2009-10-24 19:00:00.0] |  
| 2010-10-02 19:00:00.0 | [2010-09-25 19:00:00.0, 2010-10-30 19:00:00.0]                        |

in udf.jar, I have this function to get ceiling date in date_arr according to x:

class CeilToDate extends UDF {
  def evaluate(arr: Seq[Timestamp], x: Timestamp): Timestamp = {
    arr.filter(_.before(x)).last
  }
}

add jar to spark shell: spark-shell --jars udf.jar

in spark shell, I have HiveContext as val hc = new HiveContext(spc), and create function: hc.sql("create temporary function ceil_to_date as 'com.abc.udf.CeilToDate'")

when I make a query: hc.sql("select ceil_to_date(date_arr, x) as ceildate from df").show, expecting to have a dataframe like:

|ceildate              |        
|----------------------|  
|2009-09-19 19:00:00.0 |  
|2010-09-25 19:00:00.0 |

however, it throws this error:

org.apache.spark.sql.AnalysisException: No handler for Hive udf class com.abc.udf.CeilToDate because: No matching method for class com.abc.udf.CeilToDate with (array, timestamp). Possible choices: FUNC(struct<>, timestamp)

scala spark use udf function in spark shell for array manipulation in dataframe column

Answers (1)

Related Questions