great223
great223

Reputation: 1

Dataframe filter out rows (string) which contain specified words

Here is my question in Scala. I have a dataframe with one column whose data type is string. For example, the column has 4 values

"Abc", "dfea", "skjod", "aaa"

And a

List = ["ab", "kj"]

I need to filter out the rows which contain any value in the list. So, for the above data, I will get the second and the fourth rows.

Here is my code:

val del_blk = (arg: String) => {
  for (word <- list) {
    if (arg.contains(word)) 1
  }
  0
} 
val blkUDF = udf(del_blk)
df
.withColumn("blk", blkUDF(col("col")))
.filter(col("blk") === 0)
.select("col")
.show()

Upvotes: 0

Views: 314

Answers (2)

Nick
Nick

Reputation: 679

It's better not to use udfs in spark as they are not optimised and are slow. You can do this using spark SQL as follows:

df.withColumn("blk", col("col").isin(list :_*))
  .filter(col("blk") === false)

Upvotes: 0

Martijn
Martijn

Reputation: 12102

val del_blk = (arg: String) => {
  for (word <- list) {
    if (arg.contains(word)) 1
  }
  0
}

is equivalent to

val del_blk = (arg: String) => {
  list.foreach(word => if (arg.contains(word)) 1)
  0
}

I suspect you rather intend something like

def containsForbiddenWord(word: String): Boolean =
  list.exists(forbidden => word.contains(forbidden))

Upvotes: 1

Related Questions