Marsellus Wallace
Marsellus Wallace

Reputation: 18601

Generate diff of List[String] in Scalding

I have a records:TypedType[(String, util.List[String])] in my Scalding job where the first value is an id and the second a list of stuff. Imagine the following:

("1", ["a","b","c"])
("1", ["a","b","c"])
("1", ["a","b","c"])
("2", ["a","b"])
("2", ["a","b","c"])
("3", ["a","b","c"])

After records.groupBy(_._1) I'd like to output only the records that differ from each other for a given id. For the input above the output should be:

("2", ["a","b"])
("2", ["a","b","c"])

I'm new to Scalding. What's an elegant way to achieve this?

Upvotes: 0

Views: 77

Answers (2)

Dima
Dima

Reputation: 40510

If the size of values for each key is small enough to fit in memory, then something like this should do it:

records
  .group
  .toSet
  .filter(_.size > 1)
  .flatten

If it is too big, then you can join the pipe with itself:

val grouped = records.group
grouped
 .join(grouped)
 .collect { case(k, (a, b)) if a != b => k -> a }

Upvotes: 0

millhouse
millhouse

Reputation: 10007

I don't know if the Scalding aspect is critical to you (is your collection exceptionally huge?) but in plain-old Scala I'd do:

// Given:
val records = Seq( "1" -> List("a", "b", "c"), "1" -> List("a", "b", "c"), "1" -> List("a", "b", "c"), "2" -> List("a", "b"), "2" -> List("a", "b", "c"), "3" -> List("a", "b", "c"), "3" -> List("d")

val distinctValues = records.groupBy(_._1).map { case (k, v) => k -> v.toSet }
// => Map(2 -> Set((2,List(a, b)), (2,List(a, b, c))), 1 -> Set((1,List(a, b, c))), 3 -> Set((3,List(a, b, c)), (3,List(d))))

val havingMultipleDistinct = distinctValues.map { case (k, v) => v.size > 1 }
// => Map(2 -> Set((2,List(a, b)), (2,List(a, b, c))), 3 -> Set((3,List(a, b, c)), (3,List(d))))

val asRecords = havingMultipleDistinct.values.flatten
// => List((2,List(a, b)), (2,List(a, b, c)), (3,List(a, b, c)), (3,List(d)))

Upvotes: 0

Related Questions