Dba710
Dba710

Reputation: 25

How to improve java.util.zip.GZIPInputStream performance to unzip a large .gz file?

I'm trying to unzip a very large .gz file in java around 50MB and then transferring it to hadoop file system. After unzipping, the file size becomes 20 GB. It takes more than 5 min to do this job.

protected void write(BufferedInputStream bis, Path outputPath, FileSystem hdfs) throws IOException 
{
        BufferedOutputStream bos = new BufferedOutputStream(hdfs.create(outputPath));
        IOUtils.copyBytes(bis, bos, 8*1024);
}

Even after using Buffered I/O streams, it is taking very long to decompress and transfer the file.

Does Hadoop is causing file transfer to be slow or GZIPInputStream is slow?

Upvotes: 2

Views: 726

Answers (1)

Writing 20 Gb will take time. If you do it in 300 seconds you still write more than 70 Mb a second.

You may simply hit the limit of the platform.

If you rewrite your processing code to read the compressesed file that may help.

Upvotes: 1

Related Questions