Reputation: 6423
I am surprised that this throws an out of memory error considering that the operations are on top of an scala.collection.Iterator. The size of the individual lines are small (< 1KB)
Source.fromFile("largefile.txt").getLines.map(_.size).max
It appears it is trying to load the entire file in memory. Not sure which step triggers this. This is disappointing behavior for such a basic operation. Is there a simple way around it. And any reason for this design by the library implementors ?
Tried the same in Java8.
Files.lines(Paths.get("largefile.txt")).map( it -> it.length() ).max(Integer::max).get
//result: 3131
And this works predictably. Files.lines returns java.util.stream.Stream and the heap does not explode.
update: Looks like it boils down to new line interpretation. Both files are being interpreted as UTF-8, and down the line they both call java.io.BufferedReader.readLine(). So, still need to figure out where the discrepancy is. And I compiled both snippets Main classes in to the same project jar.
Upvotes: 3
Views: 549
Reputation: 1297
I'm willing to be the issue is that you're counting 'lines' differently than the getLines is. From the API:
(getLines) Returns an iterator who returns lines (NOT including newline character(s)). It will treat any of \r\n, \r, or \n as a line separator (longest match) - if you need more refined behavior you can subclass Source#LineIterator directly.
Try executing this against the file in question:
Source.fromFile("testfile.txt").getLines().
zipWithIndex.map{ case(s, i) => (s.length, i)}.
foreach(e=> if (e._1 > 1000) println(
"line: " + e._2 + " is: " + e._1 + " bytes!"))
this will tell you how many lines in the file are larger than 1K, and what the index is of the offending line.
Upvotes: 3