shj
shj

Reputation: 1678

Scala Infinite Iterator OutOfMemory

I'm playing around with Scala's lazy iterators, and I've run into an issue. What I'm trying to do is read in a large file, do a transformation, and then write out the result:

object FileProcessor {
  def main(args: Array[String]) {
    val inSource = Source.fromFile("in.txt")
    val outSource = new PrintWriter("out.txt")

    try {
      // this "basic" lazy iterator works fine
      // val iterator = inSource.getLines

      // ...but this one, which incorporates my process method, 
      // throws OutOfMemoryExceptions
      val iterator = process(inSource.getLines.toSeq).iterator

      while(iterator.hasNext) outSource.println(iterator.next)

    } finally {
      inSource.close()
      outSource.close()
    }
  }

  // processing in this case just means upper-cases every line
  private def process(contents: Seq[String]) = contents.map(_.toUpperCase)
}

So I'm getting an OutOfMemoryException on large files. I know you can run afoul of Scala's lazy Streams if you keep around references to the head of the Stream. So in this case I'm careful to convert the result of process() to an iterator and throw-away the Seq it initially returns.

Does anyone know why this still causes O(n) memory consumption? Thanks!


Update

In response to fge and huynhjl, it seems like the Seq might be the culprit, but I don't know why. As an example, the following code works fine (and I'm using Seq all over the place). This code does not produce an OutOfMemoryException:

object FileReader {
  def main(args: Array[String]) {

  val inSource = Source.fromFile("in.txt")
  val outSource = new PrintWriter("out.txt")
  try {
    writeToFile(outSource, process(inSource.getLines.toSeq))
  } finally {
    inSource.close()
    outSource.close()
  }
}

@scala.annotation.tailrec
private def writeToFile(outSource: PrintWriter, contents: Seq[String]) {
  if (! contents.isEmpty) {
    outSource.println(contents.head)
    writeToFile(outSource, contents.tail)
  }
}

private def process(contents: Seq[String]) = contents.map(_.toUpperCase)

Upvotes: 4

Views: 1137

Answers (1)

huynhjl
huynhjl

Reputation: 41646

As hinted by fge, modify process to take an iterator and remove the .toSeq. inSource.getLines is already an iterator.

Converting to a Seq will cause the items to be remembered. I think it will convert the iterator into a Stream and cause all items to be remembered.

Edit: Ok, it's more subtle. You are doing the equivalent of Iterator.toSeq.iterator by calling iterator on the result of process. This can cause an out of memory exception.

scala> Iterator.continually(1).toSeq.iterator.take(300*1024*1024).size
java.lang.OutOfMemoryError: Java heap space

It may be the same issue as reported here: https://issues.scala-lang.org/browse/SI-4835. Note my comment at the end of the bug, this is from personal experience.

Upvotes: 6

Related Questions