murali krishna
murali krishna

Reputation: 21

How to filter a column in Spark dataframe using a Array of strings?

I have to filter a column in spark dataframe using a Array[String]

I have an parameter file like below,

variable1=100,200

I read the parameter file and split each row by "=" and load in to a Map[String,String] In order to get the value, I pass the key "varaible1" and split the value by ","

val value1:Array[String] = parameterValues("varaible1").split(",")

now I need to use this value1 while filtering a dataframe.

val finalDf = testDf.filter("column1 in ($value1) and column2 in ($value1)")

I'm getting the below error,

org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input '(' expecting <EOF>(line 1, pos 12)

== SQL ==
column1 in ([Ljava.lang.String;@760b9444) and column2 in ([Ljava.lang.String;@5397d41e)
------------^^^

Any suggestions?

Upvotes: 1

Views: 2430

Answers (2)

Vincent Doba
Vincent Doba

Reputation: 5078

You can filter a column using an array as you've done. To correct your SQL expression, you need to do two things.

First, you forgot to put the 's' string interpolator at the start of your string representing your SQL expression, as below:

s"column1 in ($value1) and column2 in ($value1)"

And then, you need to convert your Array[String] to a well formatted String that will be understood as an SQL array. To do so, you can use mkString method on your value1 array:

value1.mkString("'", "','","'")

On your array Array("100", "200"), this method would return the string "'100','200'"

If we wrap up everything, we get the following expression:

val finalDf = testDf.filter(s"column1 in (${value1.mkString("'", "','","'")}) and column2 in (${value1.mkString("'", "','","'")})")

Upvotes: 1

Vincent Doba
Vincent Doba

Reputation: 5078

To filter a column by an array, you can use the isin column method:

import org.apache.spark.sql.functions.col

val finalDf = testDf.filter(col("column1").isin(value1: _*) && col("column2").isin(value1: _*))

Upvotes: 1

Related Questions