Reputation: 1678
I'm playing around with Scala's lazy iterators, and I've run into an issue. What I'm trying to do is read in a large file, do a transformation, and then write out the result:
object FileProcessor {
def main(args: Array[String]) {
val inSource = Source.fromFile("in.txt")
val outSource = new PrintWriter("out.txt")
try {
// this "basic" lazy iterator works fine
// val iterator = inSource.getLines
// ...but this one, which incorporates my process method,
// throws OutOfMemoryExceptions
val iterator = process(inSource.getLines.toSeq).iterator
while(iterator.hasNext) outSource.println(iterator.next)
} finally {
inSource.close()
outSource.close()
}
}
// processing in this case just means upper-cases every line
private def process(contents: Seq[String]) = contents.map(_.toUpperCase)
}
So I'm getting an OutOfMemoryException on large files. I know you can run afoul of Scala's lazy Streams if you keep around references to the head of the Stream. So in this case I'm careful to convert the result of process() to an iterator and throw-away the Seq it initially returns.
Does anyone know why this still causes O(n) memory consumption? Thanks!
In response to fge and huynhjl, it seems like the Seq might be the culprit, but I don't know why. As an example, the following code works fine (and I'm using Seq all over the place). This code does not produce an OutOfMemoryException:
object FileReader {
def main(args: Array[String]) {
val inSource = Source.fromFile("in.txt")
val outSource = new PrintWriter("out.txt")
try {
writeToFile(outSource, process(inSource.getLines.toSeq))
} finally {
inSource.close()
outSource.close()
}
}
@scala.annotation.tailrec
private def writeToFile(outSource: PrintWriter, contents: Seq[String]) {
if (! contents.isEmpty) {
outSource.println(contents.head)
writeToFile(outSource, contents.tail)
}
}
private def process(contents: Seq[String]) = contents.map(_.toUpperCase)
Upvotes: 4
Views: 1137
Reputation: 41646
As hinted by fge, modify process
to take an iterator and remove the .toSeq
. inSource.getLines
is already an iterator.
Converting to a Seq
will cause the items to be remembered. I think it will convert the iterator into a Stream
and cause all items to be remembered.
Edit: Ok, it's more subtle. You are doing the equivalent of Iterator.toSeq.iterator
by calling iterator
on the result of process. This can cause an out of memory exception.
scala> Iterator.continually(1).toSeq.iterator.take(300*1024*1024).size
java.lang.OutOfMemoryError: Java heap space
It may be the same issue as reported here: https://issues.scala-lang.org/browse/SI-4835. Note my comment at the end of the bug, this is from personal experience.
Upvotes: 6