Reputation: 1334
I am using Scala. I need to read a large gzip file and turn it into string. And I need to remove the first line. This is how I read the file:
val fis = new FileInputStream(filename)
val gz = new GZIPInputStream(fis)
And then I tried with this Source.fromInputStream(gz).getLines.drop(1).mkString("")
. But it causes out of memory error.
Therefore, I think of reading line by line and maybe put it into byte array. Then I can just convert it into a single String in the end.
But I have no idea how to do this. Any suggestion? Or any better method is also welcome.
Upvotes: 0
Views: 1336
Reputation: 2423
If your gzipped file is huge, you can go with BufferedReader. Here is an example. It copies all chars from gzipped file to uncompressed, but it skips the first line.
import java.util.zip.GZIPInputStream
import java.io._
import java.nio.charset.StandardCharsets
import scala.annotation.tailrec
import scala.util.Try
val bufferSize = 4096
val pathToGzFile = "/tmp/text.txt.gz"
val pathToOutputFile = "/tmp/text_without_first_line.txt"
val charset = StandardCharsets.UTF_8
val inStream = new FileInputStream(pathToGzFile)
val outStream = new FileOutputStream(pathToOutputFile)
try {
val inGzipStream = new GZIPInputStream(inStream)
val inReader = new InputStreamReader(inGzipStream, charset)
val outWriter = new OutputStreamWriter(outStream, charset)
val bufferedReader = new BufferedReader(inReader)
val closeables = Array[Closeable](inGzipStream, inReader,
outWriter, bufferedReader)
// Read first line, so copy method will not get this - it will be skipped
val firstLine = bufferedReader.readLine()
println(s"First line: $firstLine")
@tailrec
def copy(in: Reader, out: Writer, buffer: Array[Char]): Unit = {
// Copy while it's not end of file
val readChars = in.read(buffer, 0, buffer.length)
if (readChars > 0) {
out.write(buffer, 0, readChars)
copy(in, out, buffer)
}
}
// Copy chars from bufferReader to outWriter using buffer
copy(bufferedReader, outWriter, Array.ofDim[Char](bufferSize))
// Close all closeabes
closeables.foreach(c => Try(c.close()))
}
finally {
Try(inStream.close())
Try(outStream.close())
}
Upvotes: 2