Fabio
Fabio

Reputation: 617

Scala collect function

Let's say I want to print duplicates in a list with their count. So I have 3 options as shown below:

  def dups(dup:List[Int]) = {
     //1)
     println(dup.groupBy(identity).collect { case (x,ys) if ys.lengthCompare(1) > 0 => (x,ys.size) }.toSeq)
     //2)
     println(dup.groupBy(identity).collect { case (x, List(_, _, _*)) => x }.map(x => (x, dup.count(y => x == y))))
    //3)
     println(dup.distinct.map((a:Int) => (a, dup.count((b:Int) => a == b )) ).filter( (pair: (Int,Int) ) => { pair._2 > 1 } ))

  }

Questions:

-> For option 2, is there any way to name the list parameter so that it can be used to append the size of the list just like I did in option 1 using ys.size?

-> For option 1, is there any way to avoid the last call to toSeq to return a List?

-> which one of the 3 choices is more efficient by using the least amount of loops?

As an example input: List(1,1,1,2,3,4,5,5,6,100,101,101,102) Should print: List((1,3), (5,2), (101,2))

Based on @lutzh answer below the best way would be to do the following:

val list: List[(Int, Int)] = dup.groupBy(identity).collect({ case (x, ys @ List(_, _, _*)) => (x, ys.size) })(breakOut)
 val list2: List[(Int, Int)] = dup.groupBy(identity).collect { case (x, ys) if ys.lengthCompare(1) > 0 => (x, ys.size) }(breakOut)

Upvotes: 1

Views: 13451

Answers (1)

lutzh
lutzh

Reputation: 4965

For option 1 is there any way to avoid the last call to toSeq to return a List?

collect takes a CanBuildFrom, so if you assign it to something of the desired type you can use breakOut:

import collection.breakOut
val dups: List[(Int,Int)] = 
    dup
    .groupBy(identity)
    .collect({ case (x,ys) if ys.size > 1 => (x,ys.size)} )(breakOut)

collect will create a new collection (just like map), using a Builder. Usually the return type is determined by the origin type. With breakOut you basically ignore the origin type and look for a builder for the result type. So when collect creates the resulting collection, it will already create the "right" type, and you don't have to traverse the result again to convert it.

For option 2, is there any way to name the list parameter so that it can be used to append the size of the list just like I did in option 1 using ys.size?

Yes, you can bind it to a variable with @

val dups: List[(Int,Int)] = 
    dup
    .groupBy(identity)
    .collect({ case (x, ys @ List(_, _, _*)) => (x, ys.size) } )(breakOut)

which one of the 3 choices is more efficient?

Calling dup.count on a match seems inefficient, as dup needs to be traversed again then, I'd avoid that.

My guess would be that the guard (if lengthCompare(1) > 0) takes a few cycles less than the List(,,_*) pattern, but I haven't measured. And am not planning to.

Disclaimer: There may be a completely different (and more efficient) way of doing it that I can't think of right now. I'm only answering your specific questions.

Upvotes: 2

Related Questions