Reputation: 155
I am writing the following code in Spark, with the DataFrame API.
val cond = "col("firstValue") >= 0.5 & col("secondValue") >= 0.5 & col("thirdValue") >= 0.5"
val Output1 = InputDF.where(cond)
I am passing all conditions as strings from external arguments but it throws a parse error as cond
should be of type Column
.
For example:
col("firstValue") >= 0.5 & col("secondValue") >= 0.5 & col("thirdValue") >= 0.5
As I want to pass multiple conditions dynamically, how can I convert a String
to a Column
?
Is there anything through which I can read list of condition externally as Column
, because I have not found anything to convert a String
to a Column
using Scala code.
Upvotes: 1
Views: 12333
Reputation: 1
I was trying to achieve similar thing and for Scala the below code worked for me.
import org.apache.spark.sql.functions.{col, _}
val cond = (col("firstValue") >= 0.5 &
col("secondValue") >= 0.5 &
col("thirdValue") >= 0.5)
val Output1 = InputDF.where(cond)
Upvotes: 0
Reputation: 12804
I believe you may want to do something like the following:
InputDF.where("firstValue >= 0.5 and secondValue >= 0.5 and thirdValue >= 0.5")
The error you are facing is a parse error at runtime, if the error was caused by a wrong type passed in it would not even have compiled.
As you can see in the official documentation (here provided for Spark 2.3.0) the where
method can either take a sequence of Column
s (like in your latter snippet) or a string representing a SQL predicate (as in my example).
The SQL predicate will be interpreted by Spark. However I believe it's worth mentioning that you may be interested in composing your Column
s instead of concatenating strings, as the former approach minimizes the error surface by getting rid of entire classes of possible errors (for example parse errors).
You can achieve the same with the following code:
InputDF.where(col("firstValue") >= 0.5 and col("secondValue") >= 0.5 and col("thirdValue") >= 0.5)
or more concisely:
import spark.implicits._ // necessary for the $"" notation
InputDF.where($"firstValue" >= 0.5 and $"secondValue" >= 0.5 and $"thirdValue" >= 0.5)
Column
s are easily composable and more robustly so than raw strings. If you want a set of conditions to apply you can easily and
them together in a function that can be verified even before you even run the program:
def allSatisfied(condition: Column, conditions: Column*): Column =
conditions.foldLeft(condition)(_ and _)
InputDF.where(allSatisfied($"firstValue" >= 0.5, $"secondValue" >= 0.5, $"thirdValue" >= 0.5))
You can achieve the same with strings of course, but this would end up being less robust:
def allSatisfied(condition: String, conditions: String*): String =
conditions.foldLeft(condition)(_ + " and " + _)
InputDF.where(allSatisfied("firstValue >= 0.5", "secondValue >= 0.5", "thirdValue >= 0.5"))
Upvotes: 5