Reputation: 69
I want to filter a dataframe which has a column with categories (List[String]). I want to ignore all the rows that have a non valid category. They are not valid when they are not in model.getCategories
def checkIncomingData(model: Model, incomingData: DataFrame) : DataFrame = {
val list = model.getCategories.toList
sc.broadcast(list)
incomingData.filter(incomingData("categories").isin(list))
}
Unfortunately my approach does not work because categories is a list, not a single element. Any idea who to make it work?
Upvotes: 1
Views: 1966
Reputation: 18022
The first problem I see is that you didn't assign the broadcast to a variable.
val broadcastList = sc.broadcast(list)
Besides you have to reference it using broadcastList.value
. For instance:
incomingData.filter($"categories".isin(broadcastList.value: _*))
NOTE
@LostInOverflow made an important contribution, he clarified my answer and said that the method isin
is actually evaluated in the driver, so broadcasting the list doesn't help at all, and more important the list shall be expanded in order to be evaluated.
Upvotes: 3
Reputation:
Just expand list:
incomingData.filter(incomingData("categories").isin(list: _*))
Note: broadcasting won't help you here. This is evaluated on driver.
Upvotes: 1