Georg Heiler
Georg Heiler

Reputation: 17676

spark filter string column by contains one of list of strings

I need to achieve something similar to: Checking if values in List is part of String in spark. I.e. there is a dataframe of:

abcd_some long strings
goo bar baz

and an Array of desired words like ["some", "bar"].

An UDF with this code would work just fine, however, I would like to have something more efficient. Is there a way to express the FILTER my_col CONTAINS ONE OF [items] using the SQL DSL? Perhaps by dynamically constructing a REGEX?

NOTE: it is not an exat match rather a regular 'CONTAINS' / LIKE '%thing%'. I.e. not an exact match. Otherwise the isIn operator would work.

edit

probably generating some SQL code dynamically is the most efficient way.

def orFilterGeneratorMultiContains(filterPredicates:Seq[String], column:String):Column = {

    col(column).contains(filterPredicates(0)) or col(column).contains(filterPredicates(1)) // TODO iterate
  }
  def filterToDesiredApps(filterPredicates:Seq[String], column:String)(df:DataFrame):DataFrame={
      df.filter(orFilterGeneratorMultiContains(filterPredicates, column))
  }

So is need to still figure out how to properly iterate the expression.

edit 2

However, this turns out to be a bit tricky:

import org.apache.spark.sql.functions.col

val column = col("foo")
val interstingTHings = Seq("bar", "baz", "thing3")

interstingTHings.foldLeft(column) { (filteredOrColumnExpression, predicateItem) =>
  // TODO how to properly nest the OR operator?
  // filteredOrColumnExpression.contains(predicateItem) // generates: Contains(Contains(Contains('foo, bar), baz), thing3)
  filteredOrColumnExpression or filteredOrColumnExpression.contains(predicateItem) // generates: ((('foo || Contains('foo, bar)) || Contains(('foo || Contains('foo, bar)), baz)) || Contains((('foo || Contains('foo, bar)) || Contains(('foo || Contains('foo, bar)), baz)), thing3)) 
  //     TODO but what y really would need is:
  //      col(column).contains("bar") or col(column).contains("baz") or col(column).contains("thing3")
}.explain(true)

as it does not generate the correct OR nested filter conditions.

Upvotes: 1

Views: 6728

Answers (2)

Prasanna V
Prasanna V

Reputation: 31

Wouldn't rlike work in this case?

df.filter(col("foo").rlike(interestingThings.mkString("|"))

Upvotes: 3

Bi Rico
Bi Rico

Reputation: 25823

You have the right idea, but I think you want to use || not or. Something like:

def orFilterGeneratorMultiContains(filterPredicates:Seq[String], column:String): Column = {
  val coi = col(column)
  filterPredicates.map(coi.contains).reduce(_ || _)
}

Upvotes: 2

Related Questions