pez_betta
pez_betta

Reputation: 69

Filter a DataFrame by Array Column

I want to filter a dataframe which has a column with categories (List[String]). I want to ignore all the rows that have a non valid category. They are not valid when they are not in model.getCategories

def checkIncomingData(model: Model, incomingData: DataFrame) : DataFrame = {
  val list = model.getCategories.toList
  sc.broadcast(list)
  incomingData.filter(incomingData("categories").isin(list))
}

Unfortunately my approach does not work because categories is a list, not a single element. Any idea who to make it work?

Upvotes: 1

Views: 1966

Answers (2)

Alberto Bonsanto
Alberto Bonsanto

Reputation: 18022

The first problem I see is that you didn't assign the broadcast to a variable.

val broadcastList = sc.broadcast(list)

Besides you have to reference it using broadcastList.value. For instance:

incomingData.filter($"categories".isin(broadcastList.value: _*))

NOTE @LostInOverflow made an important contribution, he clarified my answer and said that the method isin is actually evaluated in the driver, so broadcasting the list doesn't help at all, and more important the list shall be expanded in order to be evaluated.

Upvotes: 3

user6022341
user6022341

Reputation:

Just expand list:

incomingData.filter(incomingData("categories").isin(list: _*))

Note: broadcasting won't help you here. This is evaluated on driver.

Upvotes: 1

Related Questions