zzztimbo
zzztimbo

Reputation: 2353

How do I filter rows based on whether a column value is in a Set of Strings in a Spark DataFrame

Is there a more elegant way of filtering based on values in a Set of String?

def myFilter(actions: Set[String], myDF: DataFrame): DataFrame = {
  val containsAction = udf((action: String) => {
    actions.contains(action)
  })

  myDF.filter(containsAction('action))
}

In SQL you can do

select * from myTable where action in ('action1', 'action2', 'action3')

Upvotes: 13

Views: 21414

Answers (1)

Justin Pihony
Justin Pihony

Reputation: 67075

How about this:

myDF.filter("action in (1,2)")

OR

import org.apache.spark.sql.functions.lit       
myDF.where($"action".in(Seq(1,2).map(lit(_)):_*))

OR

import org.apache.spark.sql.functions.lit       
myDF.where($"action".in(Seq(lit(1),lit(2)):_*))

Additional support will be added to make this cleaner in 1.5

Upvotes: 25

Related Questions