Reputation: 32284
I need to process a large number of records (several million) representing people. I would like to create a partition based on the year-of-birth, and then process each group separately. I am trying to create a functional solution (no/minimal mutable data), so that it will be thread-safe and can be parallelized.
For my first attempt, I created a tail-recursive function that builds a Map[Int, IndexedSeq]
that maps each year-of-birth to a sequence of people records. I need an indexed sequence because I will be doing random accesses to the people in each group. Here is my code:
@tailrec
def loop(people: Seq[Person],
map: Map[Int, IndexedSeq[Person]] = Map()): Map[Int, IndexedSeq[Person]] = {
if (people.isEmpty) map
else {
val person = people.head
val yearOfBirth = person.yearOfBirth
val seq = map.getOrElse(yearOfBirth, IndexedSeq())
loop(people.tail, map + (yearOfBirth -> (seq :+ person)))
}
}
This works, but is not very efficient. I can do better by allowing a small amount of very localized mutability. If all of the mutable variables are on the stack, the code will still be thread-safe, as long as the output Map
is immutable.
I would like to implement this by internally building a mutable Map[Int, List[Person]]
and then efficiently converting it to an immutable Map[Int, IndexedSeq[Person]]
as the return value.
How can I convert the mutable Map
of List
items to an immutable Map[Int, IndexedSeq[Person]]
in the most efficient manner possible? Note that there is no particular order to the people in each year-of-birth group.
Upvotes: 1
Views: 762
Reputation: 24769
Why don't you use the groupBy
function of the Seq
trait? (documentation is here: http://www.scala-lang.org/api/current/index.html#scala.collection.Seq)
def groupByYearOfBirth(people: Seq[Person]) = people.groupBy(_.yearofBirth)
Edit: contrary to my initial proposition, don't use .mapValues(_.toIndexedSeq)
to provide an
IndexedSeq`. Daniel explains why in a comment below.
Upvotes: 6