smartnut007
smartnut007

Reputation: 6423

Surprising scala Iterator "out of memory" error

I am surprised that this throws an out of memory error considering that the operations are on top of an scala.collection.Iterator. The size of the individual lines are small (< 1KB)

Source.fromFile("largefile.txt").getLines.map(_.size).max

It appears it is trying to load the entire file in memory. Not sure which step triggers this. This is disappointing behavior for such a basic operation. Is there a simple way around it. And any reason for this design by the library implementors ?

Tried the same in Java8.

Files.lines(Paths.get("largefile.txt")).map( it -> it.length() ).max(Integer::max).get
//result: 3131

And this works predictably. Files.lines returns java.util.stream.Stream and the heap does not explode.

update: Looks like it boils down to new line interpretation. Both files are being interpreted as UTF-8, and down the line they both call java.io.BufferedReader.readLine(). So, still need to figure out where the discrepancy is. And I compiled both snippets Main classes in to the same project jar.

Upvotes: 3

Views: 549

Answers (1)

snerd
snerd

Reputation: 1297

I'm willing to be the issue is that you're counting 'lines' differently than the getLines is. From the API:

(getLines) Returns an iterator who returns lines (NOT including newline character(s)). It will treat any of \r\n, \r, or \n as a line separator (longest match) - if you need more refined behavior you can subclass Source#LineIterator directly.

Try executing this against the file in question:

  Source.fromFile("testfile.txt").getLines().
    zipWithIndex.map{ case(s, i) => (s.length, i)}.
      foreach(e=> if (e._1 > 1000) println(
        "line: " + e._2 + " is: " + e._1 + " bytes!"))

this will tell you how many lines in the file are larger than 1K, and what the index is of the offending line.

Upvotes: 3

Related Questions