ypriverol
ypriverol

Reputation: 605

Scala immutable Map slow

I have a piece of code when I create a map like:

 val map = gtfLineArr(8).split(";").map(_ split "\"").collect { case Array(k, v) => (k, v) }.toMap

Then I use this map to create my object:

case class MyObject(val attribute1: String, val attribute2: Map[String:String]) 

I'm reading millions of lines and converting to MyObjects using an iterator. Like

MyObject("1", map)

When I do it is really slow, more than 1h for 2'000'000 entries.

I remove the map from the object creation, but still I do the split process (section 1):

val map = gtfLineArr(8).split(";").map(_ split "\"").collect { case Array(k, v) => (k, v) }.toMap
MyObject("1", null)

And the process the script run in less than 1 min. for the 2'000'000 millions entries.

I di'd some profiling and looks like is when the object is created the assignment between the val map to the object map is making the process slow. What I' missing?

Update to explain better the problem:

If you see my code the to explain my self iterate over 2000000 lines converting each line to an internal objet, to iterate I do:

it.map(cretateNewObject).toList

this iterator iterate through all the lines and convert them to my objects using the function createNewObject.

This is actually really fast, specially using big memory as dk14 said. The performance problem is inside my

`crateNewObject(val line:String)` 

this function create an object

`class MyObject(val attribute1:String, val attribute2:Map[String, String])` 

the my function take the line and do first

`val attributeArr = line.split("\t")` 

the first attribute record of the array is the attribute1 of my object and the second attribute is

`val map = attributeArr(8).split(";").map(_ split "\"").collect { case Array(k, v) => (k, v) }.toMap` 

if I only print the number of elements in map the programs end in 2 min, if I pass map to my new object line MyObject(attribute1, map) the program is really slow.

Upvotes: 4

Views: 1817

Answers (1)

dk14
dk14

Reputation: 22374

(0 to 2000000).toList and (0 to 2000000).map(x => x -> x).toMap have similar performance if you give them enough memory (I tried -Xmx4G - 4 Gigabytes). toMap implementation is a lot about cloning, so a lot of memory is being "allocated"/"deallocated". So, in case of memory starvation GC is becoming overactive.

When I tried to run (0 to 2000000).toList with 128Mb - it took several seconds, but (0 to 2000000).map(x => x -> x).toMap took at least 2 minutes with 10% GC activity (VisualVM), and died with out of memory.

However, when I tried -Xmx4G both were pretty fast.


P.S. What toMap does is repeatedly adding an element to a prefix tree, so it has to clone (Array.copy) a lot per every element: https://github.com/scala/scala/blob/99a82be91cbb85239f70508f6695c6b21fd3558c/src/library/scala/collection/immutable/HashMap.scala#L321.

So, toMap is repeatedly (2000000 times) doing updated0, which in its turn doing an Array.copy pretty often, which requires lots of memory allocations, which (in low-memory case) causes GC to go MarkAndSweep (slow garbage collection) most of the time (as far as I can see from jconsole).


Solution: Whether increase the memory (-Xmx/-Xms JVM parameters) or if you need more complex operations on your data-set use something like Apache Spark (or any batch-oriented map-reduce framework) to process your data in a distributed way.

Upvotes: 4

Related Questions