Reputation: 1
Here is my question in Scala. I have a dataframe with one column whose data type is string. For example, the column has 4 values
"Abc", "dfea", "skjod", "aaa"
And a
List = ["ab", "kj"]
I need to filter out the rows which contain any value in the list. So, for the above data, I will get the second and the fourth rows.
Here is my code:
val del_blk = (arg: String) => {
for (word <- list) {
if (arg.contains(word)) 1
}
0
}
val blkUDF = udf(del_blk)
df
.withColumn("blk", blkUDF(col("col")))
.filter(col("blk") === 0)
.select("col")
.show()
Upvotes: 0
Views: 314
Reputation: 679
It's better not to use udfs in spark as they are not optimised and are slow. You can do this using spark SQL as follows:
df.withColumn("blk", col("col").isin(list :_*))
.filter(col("blk") === false)
Upvotes: 0
Reputation: 12102
val del_blk = (arg: String) => {
for (word <- list) {
if (arg.contains(word)) 1
}
0
}
is equivalent to
val del_blk = (arg: String) => {
list.foreach(word => if (arg.contains(word)) 1)
0
}
I suspect you rather intend something like
def containsForbiddenWord(word: String): Boolean =
list.exists(forbidden => word.contains(forbidden))
Upvotes: 1