FUD
FUD

Reputation: 5184

Why do Scala parallel collections sometimes cause an OutOfMemoryError?

This takes around 1 second

(1 to 1000000).map(_+3)

While this gives java.lang.OutOfMemoryError: Java heap space

(1 to 1000000).par.map(_+3)

EDIT:

I have standard scala 2.9.2 configuration. I am typing this on scala prompt. And in the bash i can see [ -n "$JAVA_OPTS" ] || JAVA_OPTS="-Xmx256M -Xms32M"

AND i dont have JAVA_OPTS set in my env.

1 million integers = 8MB, creating list twice = 16MB

Upvotes: 5

Views: 994

Answers (4)

piotr
piotr

Reputation: 5787

I had the same, but using a ThreadPool seems to get rid of the problem for me:

  val threadPool = Executors.newFixedThreadPool(4)
  val quadsMinPar = quadsMin.par
  quadsMinPar.tasksupport = new ThreadPoolTaskSupport(threadPool.asInstanceOf[ThreadPoolExecutor])

ForkJoin for large collections might be creating too many threads.

Upvotes: 0

axel22
axel22

Reputation: 32335

Several reasons for the failure:

  1. Parallel collections are not specialized, so the objects get boxed. This means that you can't multiply the number of elements with 8 to get the memory usage.
  2. Using map means that the range is converted into a vector. For parallel vectors an efficient concatenation has not been implemented yet, so merging intermediate vectors produced by different processors proceeds by copying - requiring more memory. This will be addressed in future releases.
  3. The REPL stores previous results - the object evaluated in each line remains in memory.

Upvotes: 3

Matthew Farwell
Matthew Farwell

Reputation: 61695

There are two issues here, the amount of memory required to store a parallel collection and the amount of memory required to 'pass through' a parallel collection.

The difference can be seen between these two lines:

(1 to 1000000).map(_+3).toList
(1 to 1000000).par.map(_+3).toList

The REPL stores the evaluated expressions, remember. On my REPL, I can execute both of these 7 times before I run out of memory. Passing via the parallel executions uses extra memory temporarily, but once the toList is executed, that extra usage is garbage collected.

(1 to 100000).par.map(_+3)

returns a ParSeq[Int] (in this case a ParVector), which takes up more space than a normal Vector. This one I can execute 4 times before I run out of memory, whereas I can execute this:

(1 to 100000).map(_+3)

11 times before I run out of memory. So parallel collections, if you keep them around will take up more space.

As a workaround, you can transform them into simpler collections like a List before you return them.

As for why so much space is taken up by parallel collections and why it keeps references to so many things, I don't know, but I suspect views[*], and if you think it's a problem, raise an issue for it.

[*] without any real evidence.

Upvotes: 2

Nicolas
Nicolas

Reputation: 24759

It seems definitely related to the JVM memory option and to the memory required to stock a Parralel collection. For example:

scala> (1 to 1000000).par.map(_+3)

ends up with a OutOfMemoryError the third time I tried to evaluate it, while

scala> (1 to 1000000).par.map(_+3).seq

never failed. The issue is not the computation its the storage of the Parrallel collection.

Upvotes: 9

Related Questions