Darshan Manek
Darshan Manek

Reputation: 155

Pass multiple conditions as a string in where clause in Spark

I am writing the following code in Spark, with the DataFrame API.

val cond = "col("firstValue") >= 0.5 & col("secondValue") >= 0.5 & col("thirdValue") >= 0.5"
val Output1 = InputDF.where(cond)

I am passing all conditions as strings from external arguments but it throws a parse error as cond should be of type Column.

For example:

col("firstValue") >= 0.5 & col("secondValue") >= 0.5 & col("thirdValue") >= 0.5

As I want to pass multiple conditions dynamically, how can I convert a String to a Column?

Edit

Is there anything through which I can read list of condition externally as Column, because I have not found anything to convert a String to a Column using Scala code.

Upvotes: 1

Views: 12333

Answers (2)

UmeshThakur
UmeshThakur

Reputation: 1

I was trying to achieve similar thing and for Scala the below code worked for me.

import org.apache.spark.sql.functions.{col, _}

val cond = (col("firstValue") >= 0.5 & 
            col("secondValue") >= 0.5 & 
            col("thirdValue") >= 0.5)

val Output1 = InputDF.where(cond)

Upvotes: 0

stefanobaghino
stefanobaghino

Reputation: 12804

I believe you may want to do something like the following:

InputDF.where("firstValue >= 0.5 and secondValue >= 0.5 and thirdValue >= 0.5")

The error you are facing is a parse error at runtime, if the error was caused by a wrong type passed in it would not even have compiled.

As you can see in the official documentation (here provided for Spark 2.3.0) the where method can either take a sequence of Columns (like in your latter snippet) or a string representing a SQL predicate (as in my example).

The SQL predicate will be interpreted by Spark. However I believe it's worth mentioning that you may be interested in composing your Columns instead of concatenating strings, as the former approach minimizes the error surface by getting rid of entire classes of possible errors (for example parse errors).

You can achieve the same with the following code:

InputDF.where(col("firstValue") >= 0.5 and col("secondValue") >= 0.5 and col("thirdValue") >= 0.5)

or more concisely:

import spark.implicits._ // necessary for the $"" notation
InputDF.where($"firstValue" >= 0.5 and $"secondValue" >= 0.5 and $"thirdValue" >= 0.5)

Columns are easily composable and more robustly so than raw strings. If you want a set of conditions to apply you can easily and them together in a function that can be verified even before you even run the program:

def allSatisfied(condition: Column, conditions: Column*): Column =
    conditions.foldLeft(condition)(_ and _)

InputDF.where(allSatisfied($"firstValue" >= 0.5, $"secondValue" >= 0.5, $"thirdValue" >= 0.5))

You can achieve the same with strings of course, but this would end up being less robust:

def allSatisfied(condition: String, conditions: String*): String =
    conditions.foldLeft(condition)(_ + " and " + _)

InputDF.where(allSatisfied("firstValue >= 0.5", "secondValue >= 0.5", "thirdValue >= 0.5"))

Upvotes: 5

Related Questions