Reputation: 353
I'm only just starting to learn Scala, coming from Python. I was attempting a basic file processing task in Scala. The task is to remove substrings like "[ ... ]"
from data files using regex. The script successfully processes the first few files and then throws a java.lang.OutOfMemoryError: Java heap space
error. The data file at which the error occurs is about 70MB, and I have 16GB of RAM at my disposal. (The preceding 6 files have filesize < 100Kb, with the first one as an exception: 5.5MB).
My question is: what causes the OutOfMemoryError
, and how can I change my approach to prevent it from happening? I don't understand why it happens. I have little experience in debugging memory errors, as Python is relatively forgiving in memory management.
Any additional comments on coding style or the methods I use are more than welcome - I am eager to learn.
Regexer.scala:
import scala.io.Source
import java.io._
object Regexer {
def main(args: Array[String]): Unit = {
val filenames = Source.fromFile("all_files.txt").getLines()
for (fn <- filenames) {
val datafile:String = Source.fromFile(fn).mkString
val new_data:String = datafile.replaceAll(raw"\[.*?\]", "")
val file = new File(fn)
val bw = new BufferedWriter(new FileWriter(file))
bw.write(new_data)
bw.close()
}
}
}
all_files.txt
is a file containing paths to all files to process (as they are located in subdirectories).
Finally, the complete error message thrown upon execution:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
at java.lang.StringBuilder.append(StringBuilder.java:190)
at scala.collection.mutable.StringBuilder.appendAll(StringBuilder.scala:249)
at scala.io.BufferedSource.mkString(BufferedSource.scala:97)
at Regexer$$anonfun$main$1.apply(Regexer.scala:12)
at Regexer$$anonfun$main$1.apply(Regexer.scala:10)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at Regexer$.main(Regexer.scala:10)
at Regexer.main(Regexer.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.reflect.internal.util.ScalaClassLoader$$anonfun$run$1.apply(ScalaClassLoader.scala:70)
at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.reflect.internal.util.ScalaClassLoader$URLClassLoader.asContext(ScalaClassLoader.scala:101)
at scala.reflect.internal.util.ScalaClassLoader$class.run(ScalaClassLoader.scala:70)
at scala.reflect.internal.util.ScalaClassLoader$URLClassLoader.run(ScalaClassLoader.scala:101)
at scala.tools.nsc.CommonRunner$class.run(ObjectRunner.scala:22)
at scala.tools.nsc.ObjectRunner$.run(ObjectRunner.scala:39)
at scala.tools.nsc.CommonRunner$class.runAndCatch(ObjectRunner.scala:29)
at scala.tools.nsc.ObjectRunner$.runAndCatch(ObjectRunner.scala:39)
at scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:65)
at scala.tools.nsc.MainGenericRunner.run$1(MainGenericRunner.scala:87)
at scala.tools.nsc.MainGenericRunner.process(MainGenericRunner.scala:98)
at scala.tools.nsc.MainGenericRunner$.main(MainGenericRunner.scala:103)
at scala.tools.nsc.MainGenericRunner.main(MainGenericRunner.scala)
Upvotes: 2
Views: 7388
Reputation: 46
To add to puhlen answer, you can read a file line by line with :
import scala.io.Source
for(line <- Source.fromPath("myfile.txt").getLines())
Upvotes: 1
Reputation: 8529
You might have 16Gib on your computer, but that doesn't mean the JVM can use all of that. Scala code (normally) runs in the Java Virtual Machine (JVM), which has its own memory. The default amount of memory you have available might be too low for your program. The maximum available memory for you process can be set with the -Xmx
option. Try something like java -Xmx1024m Regexer
or java -Xmx2g Regexer
or however much memory you think should work. If you still get the problem after adding upping the available memory needed to process the files, then you either have some memory leak going on, or your algorithm needs to be optimized.
In your specific case, instead of loading the entire file into memory, consider processing line by line, or some other buffer amount, so that at any time you only need to keep a small portion of the file in memory
Upvotes: 9
Reputation: 14825
Don't try to load the file completely
val datafile:String = Source.fromFile(fn).mkString //this should be the culprit.
Also try to increase the heap size of the JVM in case processing line by line is not possible.
Upvotes: 2