Reputation: 605
I have a piece of code when I create a map like:
val map = gtfLineArr(8).split(";").map(_ split "\"").collect { case Array(k, v) => (k, v) }.toMap
Then I use this map to create my object:
case class MyObject(val attribute1: String, val attribute2: Map[String:String])
I'm reading millions of lines and converting to MyObjects using an iterator. Like
MyObject("1", map)
When I do it is really slow, more than 1h for 2'000'000 entries.
I remove the map from the object creation, but still I do the split process (section 1):
val map = gtfLineArr(8).split(";").map(_ split "\"").collect { case Array(k, v) => (k, v) }.toMap
MyObject("1", null)
And the process the script run in less than 1 min. for the 2'000'000 millions entries.
I di'd some profiling and looks like is when the object is created the assignment between the val map
to the object map is making the process slow. What I' missing?
Update to explain better the problem:
If you see my code the to explain my self iterate over 2000000 lines converting each line to an internal objet, to iterate I do:
it.map(cretateNewObject).toList
this iterator iterate through all the lines and convert them to my objects using the function createNewObject
.
This is actually really fast, specially using big memory as dk14 said. The performance problem is inside my
`crateNewObject(val line:String)`
this function create an object
`class MyObject(val attribute1:String, val attribute2:Map[String, String])`
the my function take the line and do first
`val attributeArr = line.split("\t")`
the first attribute record of the array is the attribute1 of my object and the second attribute is
`val map = attributeArr(8).split(";").map(_ split "\"").collect { case Array(k, v) => (k, v) }.toMap`
if I only print the number of elements in map the programs end in 2 min, if I pass map to my new object line MyObject(attribute1, map)
the program is really slow.
Upvotes: 4
Views: 1817
Reputation: 22374
(0 to 2000000).toList
and (0 to 2000000).map(x => x -> x).toMap
have similar performance if you give them enough memory (I tried -Xmx4G - 4 Gigabytes). toMap
implementation is a lot about cloning, so a lot of memory is being "allocated"/"deallocated". So, in case of memory starvation GC is becoming overactive.
When I tried to run (0 to 2000000).toList
with 128Mb - it took several seconds, but (0 to 2000000).map(x => x -> x).toMap
took at least 2 minutes with 10% GC activity (VisualVM), and died with out of memory.
However, when I tried -Xmx4G
both were pretty fast.
P.S. What toMap
does is repeatedly adding an element to a prefix tree, so it has to clone (Array.copy
) a lot per every element: https://github.com/scala/scala/blob/99a82be91cbb85239f70508f6695c6b21fd3558c/src/library/scala/collection/immutable/HashMap.scala#L321.
So, toMap
is repeatedly (2000000 times) doing updated0
, which in its turn doing an Array.copy
pretty often, which requires lots of memory allocations, which (in low-memory case) causes GC to go MarkAndSweep (slow garbage collection) most of the time (as far as I can see from jconsole).
Solution: Whether increase the memory (-Xmx
/-Xms
JVM parameters) or if you need more complex operations on your data-set use something like Apache Spark (or any batch-oriented map-reduce framework) to process your data in a distributed way.
Upvotes: 4