Reputation: 1120
I working on a project which works on a very large amount of data. I have a lot(thousands) of zip files, each containing ONE simple txt file with thousands of lines(about 80k lines). What I am currently doing is the following:
for(File zipFile: dir.listFiles()){
ZipFile zf = new ZipFile(zipFile);
ZipEntry ze = (ZipEntry) zf.entries().nextElement();
BufferedReader in = new BufferedReader(new InputStreamReader(zf.getInputStream(ze)));
...
In this way I can read the file line by line, but it is definetely too slow. Given the large number of files and lines that need to be read, I need to read them in a more efficient way.
I have looked for a different approach, but I haven't been able to find anything. What I think I should use are the java nio APIs intended right for intensive I/O operations, but I don't know how to use them with zip files.
Any help would really be appreciated.
Thanks,
Marco
Upvotes: 7
Views: 18813
Reputation: 2094
Using the advice from Java Performance, which is much like the answer from Wasim Wani, that from Satheesh Kumar : iterating over the ZIP entries to get the InputStream
of each of them and them doing something about them, I built my own solution.
In my case, the processing is the bottleneck, thus I massively launch parallel extracting at the beginning, iterating on the entries.hasMoreElements(),
and place each of the results in a ConcurrentLinkedQueue
that I consume from the processing thread. My ZIP contains a collection of XML files representing serialized Java objects, so my "extracting" includes deserializing the objects, and those deserialized objects are the ones placed in the queue.
For me, this has a few advantages compared to my previous approach of sequentially getting each file from the ZIP and processing it:
There is one disadvantage: the flow control is a little bit more complex when including parallelism.
Upvotes: 0
Reputation: 555
The right way to iterate a zip file
final ZipFile file = new ZipFile( FILE_NAME );
try
{
final Enumeration<? extends ZipEntry> entries = file.entries();
while ( entries.hasMoreElements() )
{
final ZipEntry entry = entries.nextElement();
System.out.println( entry.getName() );
//use entry input stream:
readInputStream( file.getInputStream( entry ) )
}
}
finally
{
file.close();
}
private static int readInputStream( final InputStream is ) throws IOException {
final byte[] buf = new byte[ 8192 ];
int read = 0;
int cntRead;
while ( ( cntRead = is.read( buf, 0, buf.length ) ) >=0 )
{
read += cntRead;
}
return read;
}
Zip file consists of several entries, each of them has a field containing the number of bytes in the current entry. So, it is easy to iterate all zip file entries without actual data decompression. java.util.zip.ZipFile accepts a file/file name and uses random access to jump between file positions. java.util.zip.ZipInputStream, on the other hand, is working with streams, so it is unable to freely jump. That’s why it has to read and decompress all zip data in order to reach EOF for each entry and read the next entry header.
What does it mean? If you already have a zip file in your file system – use ZipFile to process it regardless of your task. As a bonus, you can access zip entries either sequentially or randomly (with rather small performance penalty). On the other hand, if you are processing a stream, you’ll need to process all entries sequentially using ZipInputStream.
Here is an example. A zip archive (total file size = 1.6Gb) containing three 0.6Gb entries was iterated in 0.05 sec using ZipFile and in 18 sec using ZipInputStream.
Upvotes: 4
Reputation: 2515
Intel has made an improved version of zlib, which Java uses internally peroform zip/unzip. It requires you to patch zlib sources with Interl's IPP paches. I made a benchmark showing 1.4x to 3x gains in throughput.
Upvotes: 0
Reputation: 139
You can try this code
try
{
final ZipFile zf = new ZipFile("C:/Documents and Settings/satheesh/Desktop/POTL.Zip");
final Enumeration<? extends ZipEntry> entries = zf.entries();
ZipInputStream zipInput = null;
while (entries.hasMoreElements())
{
final ZipEntry zipEntry=entries.nextElement();
final String fileName = zipEntry.getName();
// zipInput = new ZipInputStream(new FileInputStream(fileName));
InputStream inputs=zf.getInputStream(zipEntry);
// final RandomAccessFile br = new RandomAccessFile(fileName, "r");
BufferedReader br = new BufferedReader(new InputStreamReader(inputs, "UTF-8"));
FileWriter fr=new FileWriter(f2);
BufferedWriter wr=new BufferedWriter(new FileWriter(f2) );
while((line = br.readLine()) != null)
{
wr.write(line);
System.out.println(line);
wr.newLine();
wr.flush();
}
br.close();
zipInput.closeEntry();
}
}
catch(Exception e)
{
System.out.print(e);
}
finally
{
System.out.println("\n\n\nThe had been extracted successfully");
}
this code works in a good manner.
Upvotes: 0
Reputation: 38122
You can use the new file API like this:
Path jarPath = Paths.get(...);
try (FileSystem jarFS = FileSystems.newFileSystem(jarPath, null)) {
Path someFileInJarPath = jarFS.getPath("/...");
try (ReadableByteChannel rbc = Files.newByteChannel(someFileInJarPath, EnumSet.of(StandardOpenOption.READ))) {
// read file
}
}
The code is for jar files, but I think it should work for zips as well.
Upvotes: 1
Reputation: 500167
I have a lot(thousands) of zip files. The zipped files are about 30MB each, while the txt inside the zip file is about 60/70 MB. Reading and processing the files with this code takes a lot of hours, around 15, but it depends.
Let's do some back-of-the-envelope calculations.
Let's say you have 5000 files. If it takes 15 hours to process them, this equates to ~10 seconds per file. The files are about 30MB each, so the throughput is ~3MB/s.
This is between one and two orders of magnitude slower than the rate at which ZipFile
can decompress stuff.
Either there's a problem with the disks (are they local, or a network share?), or it is the actual processing that is taking most of the time.
The best way to find out for sure is by using a profiler.
Upvotes: 3