Surprising scala Iterator "out of memory" error

Question

I am surprised that this throws an out of memory error considering that the operations are on top of an scala.collection.Iterator. The size of the individual lines are small (< 1KB)

Source.fromFile("largefile.txt").getLines.map(_.size).max

It appears it is trying to load the entire file in memory. Not sure which step triggers this. This is disappointing behavior for such a basic operation. Is there a simple way around it. And any reason for this design by the library implementors ?

Tried the same in Java8.

Files.lines(Paths.get("largefile.txt")).map( it -> it.length() ).max(Integer::max).get
//result: 3131

And this works predictably. Files.lines returns java.util.stream.Stream and the heap does not explode.

update: Looks like it boils down to new line interpretation. Both files are being interpreted as UTF-8, and down the line they both call java.io.BufferedReader.readLine(). So, still need to figure out where the discrepancy is. And I compiled both snippets Main classes in to the same project jar.

snerd · Accepted Answer

I'm willing to be the issue is that you're counting 'lines' differently than the getLines is. From the API:

(getLines) Returns an iterator who returns lines (NOT including newline character(s)). It will treat any of , , or as a line separator (longest match) - if you need more refined behavior you can subclass Source#LineIterator directly.

Try executing this against the file in question:

  Source.fromFile("testfile.txt").getLines().
    zipWithIndex.map{ case(s, i) => (s.length, i)}.
      foreach(e=> if (e._1 > 1000) println(
        "line: " + e._2 + " is: " + e._1 + " bytes!"))

this will tell you how many lines in the file are larger than 1K, and what the index is of the offending line.

Surprising scala Iterator "out of memory" error

Answers (1)

Related Questions

Surprising scala Iterator &quot;out of memory&quot; error

Answers (1)

Related Questions

Surprising scala Iterator "out of memory" error