Reputation: 890
I would like to filter a dataset to only keep the rows of users which occur at least X times in the dataset. In the example below I would only like to keep the rows with "id1" and "id2". How can I achieve this?
case class UserValue(UserId: String, value: Double)
val values = sc.parallelize(Seq(UserValue("id1", 5.0),
Rating("id2", 4.0),
Rating("id2", 3.0),
Rating("id1", 2.0),
Rating("id3", 1.0)))
Upvotes: 0
Views: 64
Reputation: 215107
You can groupBy UserId, filter
based the size of each group and then use flatMap
to transform it back:
values.groupBy(_.UserId).
filter{ case (k, v) => v.size >= 2 }.
flatMap{ case (k, v) => v }.collect
// res72: Array[UserValue] = Array(UserValue(id1,5.0), UserValue(id1,2.0), UserValue(id2,4.0), UserValue(id2,3.0))
Upvotes: 2