Reputation: 1695
I'm new to scala and figuring out things on the fly. I have a program that needs to read Gzip files of various sizes - 20KB, 2MB and 150MB(Yes, the zipped file is 150MB). I would think not to have a different approach for reading different files, but a standard one through-out. Most of the approaches that I see use a buffer size of 64MB to read files line by line? What is the best( read as, *fastest and clean memory * way of doing it) way to do this ?
Thanks in advance,for the help!
update 1:
Great improvments in reading rate.(I would even share my karma points) Thanks SO ! :)
But, I noticed that, since each of my file has around 10K lines, while writing them to file, It takes a long time to convert the String Iterator to a string before writing to file.I can do two approaches,
I'm assuming [2] would be faster. So, this is what am doing for writing,
var processedLines = linesFromGzip(new File(fileName)).map(line => MyFunction(line))
var outFile = Resource.fromFile(outFileName)
outFile.write(processedLines.mkString("\n")) // severe overhead -> processedLines.mkString("\n")
Also my analysis( by commenting the write() shows that, it doesn't take much of time to write but to convert the processedLines
to a Single big String - It takes close to a second - which is huge cost for my application. What would the best(again clean without any memory leaks) way to do this.
Upvotes: 0
Views: 2750
Reputation: 3792
Your memory problem is caused by having too many open files, not by the size of the files. You need a mechanism to automagically close each file after reading it.
One way of doing it:
// this Source closes at the end of iteration
implicit def closingSource(source: Source) = new {
val lines = source.getLines()
var isOpen = true
def closeAfterGetLines() = new Iterator[String] {
def hasNext = isOpen && hasNextAndCloseIfDone
def next() = {
val line = lines.next()
hasNextAndCloseIfDone
line
}
private def hasNextAndCloseIfDone = if (lines.hasNext) true else { source.close() ; isOpen = false ; false }
}
}
and then you use a gzip reader:
def gzInputStream(gzipFile: File) = new GZIPInputStream(new BufferedInputStream(new FileInputStream(gzipFile)))
def linesFomGzip(gzipFile: File): Iterator[String] = {
Source.fromInputStream(gzInputStream(gzipFile)).closeAfterGetLines()
}
Note that the files are closed only if the iteration is completed, i.e. the entire file is read. If (for some reason) you do not read the entire file, you need to manually close the file.
Upvotes: 3