Learner
Learner

Reputation: 1695

Gzip file reader Scala

I'm new to scala and figuring out things on the fly. I have a program that needs to read Gzip files of various sizes - 20KB, 2MB and 150MB(Yes, the zipped file is 150MB). I would think not to have a different approach for reading different files, but a standard one through-out. Most of the approaches that I see use a buffer size of 64MB to read files line by line? What is the best( read as, *fastest and clean memory * way of doing it) way to do this ?

Thanks in advance,for the help!

update 1:

Great improvments in reading rate.(I would even share my karma points) Thanks SO ! :)

But, I noticed that, since each of my file has around 10K lines, while writing them to file, It takes a long time to convert the String Iterator to a string before writing to file.I can do two approaches,

  1. Iterator line by line and write line by line to file.
  2. Iterate line by line to convert the lines to a big string ("\n" delimited) and write that big string to file.

I'm assuming [2] would be faster. So, this is what am doing for writing,

var processedLines = linesFromGzip(new File(fileName)).map(line => MyFunction(line))

var  outFile = Resource.fromFile(outFileName)

outFile.write(processedLines.mkString("\n"))  // severe overhead -> processedLines.mkString("\n")

Also my analysis( by commenting the write() shows that, it doesn't take much of time to write but to convert the processedLines to a Single big String - It takes close to a second - which is huge cost for my application. What would the best(again clean without any memory leaks) way to do this.

Upvotes: 0

Views: 2750

Answers (1)

Adrian
Adrian

Reputation: 3792

Your memory problem is caused by having too many open files, not by the size of the files. You need a mechanism to automagically close each file after reading it.

One way of doing it:

      // this Source closes at the end of iteration                        
      implicit def closingSource(source: Source) = new {
        val lines = source.getLines()
        var isOpen = true
        def closeAfterGetLines() = new Iterator[String] {
          def hasNext = isOpen && hasNextAndCloseIfDone
          def next() = {
            val line = lines.next()
            hasNextAndCloseIfDone
            line
          }
          private def hasNextAndCloseIfDone = if (lines.hasNext) true else { source.close() ; isOpen = false ; false }
        }
      }

and then you use a gzip reader:

def gzInputStream(gzipFile: File) = new GZIPInputStream(new BufferedInputStream(new FileInputStream(gzipFile)))

def linesFomGzip(gzipFile: File): Iterator[String] = {
            Source.fromInputStream(gzInputStream(gzipFile)).closeAfterGetLines()
          }

Note that the files are closed only if the iteration is completed, i.e. the entire file is read. If (for some reason) you do not read the entire file, you need to manually close the file.

Upvotes: 3

Related Questions