Spark - subset dataset based on occurences of user id

Question

I would like to filter a dataset to only keep the rows of users which occur at least X times in the dataset. In the example below I would only like to keep the rows with "id1" and "id2". How can I achieve this?

  case class UserValue(UserId: String, value: Double)

  val values = sc.parallelize(Seq(UserValue("id1", 5.0),
        Rating("id2", 4.0),
        Rating("id2", 3.0),
        Rating("id1", 2.0),
        Rating("id3", 1.0)))

akuiper · Accepted Answer

You can groupBy UserId, filter based the size of each group and then use flatMap to transform it back:

values.groupBy(_.UserId).
       filter{ case (k, v) => v.size >= 2 }.
       flatMap{ case (k, v) => v }.collect

// res72: Array[UserValue] = Array(UserValue(id1,5.0), UserValue(id1,2.0), UserValue(id2,4.0), UserValue(id2,3.0))

Spark - subset dataset based on occurences of user id

Answers (1)

Related Questions