How to pass dataset column value to a function while using spark filter with scala?

Question

I have an action array which consists of user id and action type

+-------+-------+
|user_id|   type|
+-------+-------+
|     11| SEARCH|
+-------+-------+
|     11| DETAIL|
+-------+-------+
|     12| SEARCH|
+-------+-------+

I want to filter actions that belongs to the users who have at least one search action.

So I created a bloom filter with user ids who has SEARCH action.

Then I tried to filter all actions depending on bloom filter's user status

val df = spark.read...
val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect
val bloomFilter = BloomFilter.create(100)
searchers.foreach(bloomFilter.putString(_))
df.filter(bloomFilter.mightContainString($"user_id"))

But the code gives an exception

type mismatch;
found   : org.apache.spark.sql.ColumnName
required: String

Please let me know how can I pass column value to the BloomFilter.mightContainString method?

Chitral Verma · Accepted Answer

You can do something like this,

val sparkSession = ???
val sc = sparkSession.sparkContext

val bloomFilter = BloomFilter.create(100)

val df = ???

val searchers = df.filter($"type" === "SEARCH").select("user_id").distinct.as[String].collect

At this point, i'll mention the fact that collect is not a good idea. Next you can do something like.

import org.apache.spark.sql.functions.udf
val bbFilter = sc.broadcast(bloomFilter)

val filterUDF = udf((s: String) => bbFilter.value.mightContainString(s))

df.filter(filterUDF($"user_id"))

You can remove the broadcasting if the bloomFilter instance is serializable.

Hope this helps, Cheers.

How to pass dataset column value to a function while using spark filter with scala?

Answers (2)

Related Questions