Reputation: 21
I have to filter a column in spark dataframe using a Array[String]
I have an parameter file like below,
variable1=100,200
I read the parameter file and split each row by "=" and load in to a Map[String,String] In order to get the value, I pass the key "varaible1" and split the value by ","
val value1:Array[String] = parameterValues("varaible1").split(",")
now I need to use this value1 while filtering a dataframe.
val finalDf = testDf.filter("column1 in ($value1) and column2 in ($value1)")
I'm getting the below error,
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '(' expecting <EOF>(line 1, pos 12)
== SQL ==
column1 in ([Ljava.lang.String;@760b9444) and column2 in ([Ljava.lang.String;@5397d41e)
------------^^^
Any suggestions?
Upvotes: 1
Views: 2430
Reputation: 5078
You can filter a column using an array as you've done. To correct your SQL expression, you need to do two things.
First, you forgot to put the 's' string interpolator at the start of your string representing your SQL expression, as below:
s"column1 in ($value1) and column2 in ($value1)"
And then, you need to convert your Array[String]
to a well formatted String that will be understood as an SQL array. To do so, you can use mkString
method on your value1
array:
value1.mkString("'", "','","'")
On your array Array("100", "200")
, this method would return the string "'100','200'"
If we wrap up everything, we get the following expression:
val finalDf = testDf.filter(s"column1 in (${value1.mkString("'", "','","'")}) and column2 in (${value1.mkString("'", "','","'")})")
Upvotes: 1
Reputation: 5078
To filter a column by an array, you can use the isin
column method:
import org.apache.spark.sql.functions.col
val finalDf = testDf.filter(col("column1").isin(value1: _*) && col("column2").isin(value1: _*))
Upvotes: 1