dexter80
dexter80

Reputation: 75

Higher Order functions in Spark SQL

Can anyone please explain the transform() and filter() in Spark Sql 2.4 with some advanced real-world use-case examples ?

In a sql query, is this only to be used with array columns or it can also be applied to any column type in general. It would be great if anyone could demonstrate with a sql query for an advanced application.

Thanks in advance.

Upvotes: 3

Views: 2828

Answers (1)

Ged
Ged

Reputation: 18098

Not going down the .filter road as I cannot see the focus there.

For .transform

  • dataframe transform at DF-level
  • transform on an array of a DF in v 2.4
  • transform on an array of a DF in v 3

The following:

dataframe transform

From the official docs https://kb.databricks.com/data/chained-transformations.html transform on DF can end up like spaghetti. Opinion can differ here.

This they say is messy:

...
def inc(i: Int) = i + 1

val tmp0 = func0(inc, 3)(testDf) 
val tmp1 = func1(1)(tmp0) 
val tmp2 = func2(2)(tmp1) 
val res = tmp2.withColumn("col3", expr("col2 + 3"))

compared to:

val res = testDf.transform(func0(inc, 4))
                .transform(func1(1))
                .transform(func2(2))
                .withColumn("col3", expr("col2 + 3"))

transform with lambda function on an array of a DF in v 2.4 which needs the select and expr combination

import org.apache.spark.sql.functions._
val df = Seq(Seq(Array(1,999),Array(2,9999)),  
         Seq(Array(10,888),Array(20,8888))).toDF("c1")
val df2 = df.select(expr("transform(c1, x -> x[1])").as("last_vals"))

transform with lambda function new array function on a DF in v 3 using withColumn

import org.apache.spark.sql.functions._ import org.apache.spark.sql._

val df = Seq(
             (Array("New York", "Seattle")),
             (Array("Barcelona", "Bangalore"))
             ).toDF("cities")
val df2 = df.withColumn("fun_cities", transform(col("cities"), 
                        (col: Column) => concat(col, lit(" is fun!"))))

Try them.

Final note and excellent point raised (from https://mungingdata.com/spark-3/array-exists-forall-transform-aggregate-zip_with/):

transform works similar to the map function in Scala. I’m not sure why they chose to name this function transform… I think array_map would have been a better name, especially because the Dataset#transform function is commonly used to chain DataFrame transformations.

Update

If wanting to use %sql or display approach for Higher Order Functions, then consult this: https://docs.databricks.com/delta/data-transformation/higher-order-lambda-functions.html

Upvotes: 2

Related Questions