Performance Difference Using Update Operation on a Mutable Map in Scala with a Large Size Data

I would like to know if an update operation on a mutable map is better in performance than reassignment.

Lets assume I have the following Map

val m=Map(1 -> Set("apple", "banana"),
          2 -> Set("banana", "cabbage"),
          3 -> Set("cabbage", "dumplings"))

which I would like to reverse into this map:

 Map("apple" -> Set(1),
     "banana" -> Set(1, 2),
     "cabbage" -> Set(2, 3),
     "dumplings" -> Set(3))

The code to do so is:

def reverse(m:Map[Int,Set[String]])={
  var rm = Map[String,Set[Int]]()
  m.keySet foreach { k=>
       m(k) foreach { e =>
         rm = rm + (e -> (rm.getOrElse(e, Set()) + k))
       }
  }
  rm
}

Would it be more efficient to use the update operator on a map if it is very large in size?

The code using the update on map is as follows:

def reverse(m:Map[Int,Set[String]])={
  var rm = scala.collection.mutable.Map[String,Set[Int]]()
  m.keySet foreach { k=>
      m(k) foreach { e =>
         rm.update(e,(rm.getOrElse(e, Set()) + k))                                                        
      }
  }
  rm
}

Upvotes: 1

Answers (3)

user-asterix

Reputation: 936

Just using @Aivean method in a functional way:

def transform(mp :Map[Int, Set[String]]) = {
   val accum = mp.values.flatten
                 .toSet.map( (_-> scala.collection.mutable.Set[Int]())).toMap
   mp.map {case(k,vals) => vals.map( v => accum(v)+=k)}
   accum.mapValues(_.toSet)
}

Upvotes: 0

jwvh

Reputation: 51271

I ran some tests using Rex Kerr's Thyme utility.

First I created some test data.

val rndm = new util.Random
val dna = Seq('A','C','G','T')
val m = (1 to 4000).map(_ -> Set(rndm.shuffle(dna).mkString
                                ,rndm.shuffle(dna).mkString)).toMap

Then I timed some runs with both the immutable.Map and mutable.Map versions. Here's an example result:

Time:    2.417 ms   95% CI 2.337 ms - 2.498 ms   (n=19)  // immutable
Time:    1.618 ms   95% CI 1.579 ms - 1.657 ms   (n=19)  // mutable
Time     2.278 ms   95% CI 2.238 ms - 2.319 ms   (n=19)  // functional version

As you can see, using a mutable Map with update() has a significant performance advantage.

Just for fun I also compared these results with a more functional version of a Map reverse (or what I call a Map inverter). No var or any mutable type involved.

m.flatten{case(k, vs) => vs.map((_, k))}
 .groupBy(_._1)
 .mapValues(_.map(_._2).toSet)

This version consistently beat your immutable version but still doesn't come close to the mutable timings.

Upvotes: 3

Aivean

Reputation: 10882

The trade-of between mutable and immutable collections usually narrows down to this:

immutable collections are safer to share and allows to use structural sharing
mutable collections have better performance

Some time ago I did comparison of performance between mutable and immutable Maps in Scala and the difference was about 2 to 3 times in favor of mutable ones.

So, when performance is not critical I usually go with immutable collections for safety and readability.

For example, in your case functional "scala way" of performing this transformation would be something like this:

m.view
 .flatMap(x => x._2.map(_ -> x._1))  // flatten map to lazy view of String->Int pairs
 .groupBy(_._1)                      // group pairs by String part
 .mapValues(_.map(_._2).toSet)       // extract all Int parts into Set

Although I used lazy view to avoid creating intermediate collections, groupBy still internally creates mutable map (you may want to check it's sources, the logic is pretty similar to what you have wrote), which in turn gets converted to immutable Map which then gets discarded by mapValues.

Now, if you want to squeeze every bit of performance you want to use mutable collections and do as little updates of immutable collections as possible.

For your case is means having Map of mutable Sets as you intermediate buffer:

def transform(m:Map[Int, Set[String]]):Map[String, Set[Int]] = {
  val accum:Map[String, mutable.Set[Int]] = 
    m.valuesIterator.flatten.map(_ -> mutable.Set[Int]()).toMap

  for ((k, vals) <- m; v <- vals) {
    accum(v) += k
  }
  accum.mapValues(_.toSet)
}

Note, I'm not updating accum once it's created: I'm doing exactly one map lookup and one set update for each value, while in both your examples there was additional map update.

I believe this code is reasonably optimal performance wise. I didn't perform any tests myself, but I highly encourage you to do that on your real data and post results here.

Also, if you want to go even further, you might want to try mutable BitSet instead of Set[Int]. If ints in your data are fairly small it might yield some minor performance increase.

Upvotes: 1

Performance Difference Using Update Operation on a Mutable Map in Scala with a Large Size Data

Answers (3)

Related Questions