UtkarshSahu
UtkarshSahu

Reputation: 93

PySpark - pass a value from another column as the parameter of spark function

I have a spark dataframe which looks like this where expr is SQL/Hive filter expression.

+-----------------------------------------+
|expr                     |var1     |var2 |
+-------------------------+---------+-----+
|var1 > 7                 |9        |0    |
|var1 > 7                 |9        |0    |
|var1 > 7                 |9        |0    |
|var1 > 7                 |9        |0    |
|var1 = 3 AND var2 >= 0   |9        |0    |
|var1 = 3 AND var2 >= 0   |9        |0    |
|var1 = 3 AND var2 >= 0   |9        |0    |
|var1 = 3 AND var2 >= 0   |9        |0    |
|var1 = 2 AND var2 >= 0   |9        |0    |
+-------------------------+---------+-----+

I want to transform this dataframe to the dataframe below where flag is the boolean value found after evaluating the expression in column 'expr'

+---------------------------------------------------+
|expr                     |var1     |var2 |flag     |
+-------------------------+---------+-----+---------+
|var1 > 7                 |9        |0    |  True   |
|var1 > 7                 |9        |0    |  True   |
|var1 > 7                 |9        |0    |  True   |
|var1 > 7                 |9        |0    |  True   |
|var1 = 3 AND var2 >= 0   |9        |0    |     .   |
|var1 = 3 AND var2 >= 0   |9        |0    |     .   |
|var1 = 3 AND var2 >= 0   |9        |0    |     .   |
|var1 = 3 AND var2 >= 0   |9        |0    |     .   |
|var1 = 2 AND var2 >= 0   |9        |0    |     .   |
+-------------------------+---------+-----+---------+

I have tried using expr function like this:

df.withColumn('flag', expr(col('expr')))

It will fail as expected because expr function expects a string as parameter.

Another idea I thought of using is making a UDF and passing the 'expr' column's value to it, but that will not allow me to use the expr function of pyspark because UDFs are all non-spark code.

What should my approach be? Any suggestions please?

Upvotes: 7

Views: 9498

Answers (2)

Topde
Topde

Reputation: 581

So here's a PySpark solution without a UDF. In Scala I believe you could use map or foldleft with the same logic.

exprs = df.select('expr').distinct().collect()[0][0]

for ex in exprs:
    df = df.withColumn('test', when(col('expr') == lit(ex), expr(ex)))
    
df.show()
+--------------------+----+----+----+
|                expr|var1|var2|test|
+--------------------+----+----+----+
|            var1 > 7|   9|   0|true|
|            var1 > 7|   9|   0|true|
|            var1 > 7|   9|   0|true|
|            var1 > 7|   9|   0|true|
|var1 = 3 AND var2...|   9|   0|null|
|var1 = 3 AND var2...|   9|   0|null|
|var1 = 3 AND var2...|   9|   0|null|
|var1 = 3 AND var2...|   9|   0|null|
|var1 = 2 AND var2...|   9|   0|null|
+--------------------+----+----+----+

I should point out that I don't understand why the OP wants to do this, if they provide better context to the problem I'm sure there's a better way.

Iterating over a DF isn't the most efficient thing to do, but in this case it will actually work very fast as it doesn't iterate over the data so Spark will actually execute it within one plan. Also a single collect() only adds 2 seconds to the execution time on a 20+ million DF.


UPDATE:

I understand the problem a bit better now, this will be faster as Spark will calculate all of the filters at once before coalescing them into one column.

# Tip: perform the collect statement on the smaller DF that contains the filter expressions

exprs = df.select('expr').distinct().collect()[0][0]

df = df.withColumn('filter',
              coalesce(*[when(col('expr') == lit(ex), expr(ex)) for ex in exprs])
             )
df.show()
+--------------------+----+----+------+
|                expr|var1|var2|filter|
+--------------------+----+----+------+
|            var1 > 7|   9|   0|true  |
|            var1 > 7|   9|   0|true  |
|            var1 > 7|   9|   0|true  |
|            var1 > 7|   9|   0|true  |
|var1 = 3 AND var2...|   9|   0|null  |
|var1 = 3 AND var2...|   9|   0|null  |
|var1 = 3 AND var2...|   9|   0|null  |
|var1 = 3 AND var2...|   9|   0|null  |
|var1 = 2 AND var2...|   9|   0|null  |
+--------------------+----+----+------+

Upvotes: 5

mvasyliv
mvasyliv

Reputation: 1214

not udf

  val exprs5 =   sourceDF.select('expr).distinct().as[String].collect()
  val d1 = exprs5.map(i => {
    val df = sourceDF.filter('expr.equalTo(i))
    df.withColumn("flag", expr(i))
  })
  val d2 = d1.reduce(_ union _)

udf

package spark

import org.apache.spark.sql.{DataFrame, SparkSession}

object Filter extends App {

  val spark = SparkSession.builder()
    .master("local")
    .appName("DataFrame-example")
    .getOrCreate()

  import spark.implicits._

  val sourceDF = Seq(("var1 > 7", 9, 0),
  ("var1 > 7", 9, 0),
  ("var1 > 7", 9, 0),
  ("var1 > 7", 9, 0),
  ("var1 = 3 AND var2 >= 0", 9, 0),
  ("var1 = 3 AND var2 >= 0", 9, 0),
  ("var1 = 3 AND var2 >= 0", 9, 0),
  ("var1 = 3 AND var2 >= 0", 9, 0),
  ("var1 = 2 AND var2 >= 0", 9, 0)).toDF("expr", "var1","var2")

  import org.apache.spark.sql.functions._

  val fCheck = udf((expr: String, val1: Int, val2: Int) => {
    expr.split(" ") match{
      case Array(vr, z, vl) if (vr == "var1" && z == ">") => Some(val1 > vl.toInt)
      case Array(vr1, z1,  vl1, logic1, vr2, z2, vl2)
        if (vr1 == "var1") && (z1 == "=") && (logic1 == "AND") && (vr2 == "var2") && (z2 == ">=")
      => Some((val1 == vl1.toInt ) && (val2 >= vl2.toInt))
      case _ => None
    }
  })

  val resultDF = sourceDF.withColumn("flag", lit(fCheck('expr, 'var1, 'var2)))

  resultDF.show(false)
//  +----------------------+----+----+-----+
//  |expr                  |var1|var2|flag |
//  +----------------------+----+----+-----+
//  |var1 > 7              |9   |0   |true |
//  |var1 > 7              |9   |0   |true |
//  |var1 > 7              |9   |0   |true |
//  |var1 > 7              |9   |0   |true |
//  |var1 = 3 AND var2 >= 0|9   |0   |false|
//  |var1 = 3 AND var2 >= 0|9   |0   |false|
//  |var1 = 3 AND var2 >= 0|9   |0   |false|
//  |var1 = 3 AND var2 >= 0|9   |0   |false|
//  |var1 = 2 AND var2 >= 0|9   |0   |false|
//  +----------------------+----+----+-----+

}

Upvotes: 0

Related Questions