smellyarmpits
smellyarmpits

Reputation: 1120

Reading zip file efficiently in Java

I working on a project which works on a very large amount of data. I have a lot(thousands) of zip files, each containing ONE simple txt file with thousands of lines(about 80k lines). What I am currently doing is the following:

for(File zipFile: dir.listFiles()){
ZipFile zf = new ZipFile(zipFile);
ZipEntry ze = (ZipEntry) zf.entries().nextElement();
BufferedReader in = new BufferedReader(new InputStreamReader(zf.getInputStream(ze)));
...

In this way I can read the file line by line, but it is definetely too slow. Given the large number of files and lines that need to be read, I need to read them in a more efficient way.

I have looked for a different approach, but I haven't been able to find anything. What I think I should use are the java nio APIs intended right for intensive I/O operations, but I don't know how to use them with zip files.

Any help would really be appreciated.

Thanks,

Marco

Upvotes: 7

Views: 18813

Answers (6)

manuelvigarcia
manuelvigarcia

Reputation: 2094

Asynchronous unpacking and synchronous processing

Using the advice from Java Performance, which is much like the answer from Wasim Wani, that from Satheesh Kumar : iterating over the ZIP entries to get the InputStream of each of them and them doing something about them, I built my own solution.

In my case, the processing is the bottleneck, thus I massively launch parallel extracting at the beginning, iterating on the entries.hasMoreElements(), and place each of the results in a ConcurrentLinkedQueue that I consume from the processing thread. My ZIP contains a collection of XML files representing serialized Java objects, so my "extracting" includes deserializing the objects, and those deserialized objects are the ones placed in the queue.

For me, this has a few advantages compared to my previous approach of sequentially getting each file from the ZIP and processing it:

  1. the more compelling one: 10% reduction in total time
  2. the release of the file occurs earlier
  3. the whole amount of RAM is allocated quicker, so if there is not enough RAM it will fail faster (in a matter of tens of minutes instead of over one hour); please note that the amount of memory I keep allocated after processing is quite similar to that occupied by the unzipped files, otherwise, it would be better to unzip and discard sequentially to keep the memory footprint lower
  4. unzipping and deserializing seems to have a high CPU usage, so the faster is finished, the faster you get your CPU for the processing, which is what really matters

There is one disadvantage: the flow control is a little bit more complex when including parallelism.

Upvotes: 0

Wasim Wani
Wasim Wani

Reputation: 555

The right way to iterate a zip file

final ZipFile file = new ZipFile( FILE_NAME );
try
{
    final Enumeration<? extends ZipEntry> entries = file.entries();
    while ( entries.hasMoreElements() )
    {
        final ZipEntry entry = entries.nextElement();
        System.out.println( entry.getName() );
        //use entry input stream:
        readInputStream( file.getInputStream( entry ) )
    }
}
finally
{
    file.close();
}

private static int readInputStream( final InputStream is ) throws IOException {
    final byte[] buf = new byte[ 8192 ];
    int read = 0;
    int cntRead;
    while ( ( cntRead = is.read( buf, 0, buf.length ) ) >=0  )
    {
        read += cntRead;
    }
    return read;
}

Zip file consists of several entries, each of them has a field containing the number of bytes in the current entry. So, it is easy to iterate all zip file entries without actual data decompression. java.util.zip.ZipFile accepts a file/file name and uses random access to jump between file positions. java.util.zip.ZipInputStream, on the other hand, is working with streams, so it is unable to freely jump. That’s why it has to read and decompress all zip data in order to reach EOF for each entry and read the next entry header.

What does it mean? If you already have a zip file in your file system – use ZipFile to process it regardless of your task. As a bonus, you can access zip entries either sequentially or randomly (with rather small performance penalty). On the other hand, if you are processing a stream, you’ll need to process all entries sequentially using ZipInputStream.

Here is an example. A zip archive (total file size = 1.6Gb) containing three 0.6Gb entries was iterated in 0.05 sec using ZipFile and in 18 sec using ZipInputStream.

Upvotes: 4

milan
milan

Reputation: 2515

Intel has made an improved version of zlib, which Java uses internally peroform zip/unzip. It requires you to patch zlib sources with Interl's IPP paches. I made a benchmark showing 1.4x to 3x gains in throughput.

Upvotes: 0

satheesh kumar
satheesh kumar

Reputation: 139

You can try this code

try
    {

        final ZipFile zf = new ZipFile("C:/Documents and Settings/satheesh/Desktop/POTL.Zip");

        final Enumeration<? extends ZipEntry> entries = zf.entries();
        ZipInputStream zipInput = null;

        while (entries.hasMoreElements())
        {
            final ZipEntry zipEntry=entries.nextElement();
            final String fileName = zipEntry.getName();
        // zipInput = new ZipInputStream(new FileInputStream(fileName));
            InputStream inputs=zf.getInputStream(zipEntry);
            //  final RandomAccessFile br = new RandomAccessFile(fileName, "r");
                BufferedReader br = new BufferedReader(new InputStreamReader(inputs, "UTF-8"));
                FileWriter fr=new FileWriter(f2);
            BufferedWriter wr=new BufferedWriter(new FileWriter(f2) );

            while((line = br.readLine()) != null)
            {
                wr.write(line);
                System.out.println(line);
                wr.newLine();
                wr.flush();
            }
            br.close();
            zipInput.closeEntry();
        }


    }
    catch(Exception e)
    {
        System.out.print(e);
    }
    finally
    {
        System.out.println("\n\n\nThe had been extracted successfully");

    }

this code works in a good manner.

Upvotes: 0

Puce
Puce

Reputation: 38122

You can use the new file API like this:

Path jarPath = Paths.get(...);
try (FileSystem jarFS = FileSystems.newFileSystem(jarPath, null)) {
    Path someFileInJarPath = jarFS.getPath("/...");
    try (ReadableByteChannel rbc = Files.newByteChannel(someFileInJarPath, EnumSet.of(StandardOpenOption.READ))) {
        // read file
    }
}

The code is for jar files, but I think it should work for zips as well.

Upvotes: 1

NPE
NPE

Reputation: 500167

I have a lot(thousands) of zip files. The zipped files are about 30MB each, while the txt inside the zip file is about 60/70 MB. Reading and processing the files with this code takes a lot of hours, around 15, but it depends.

Let's do some back-of-the-envelope calculations.

Let's say you have 5000 files. If it takes 15 hours to process them, this equates to ~10 seconds per file. The files are about 30MB each, so the throughput is ~3MB/s.

This is between one and two orders of magnitude slower than the rate at which ZipFile can decompress stuff.

Either there's a problem with the disks (are they local, or a network share?), or it is the actual processing that is taking most of the time.

The best way to find out for sure is by using a profiler.

Upvotes: 3

Related Questions