Fedearne
Fedearne

Reputation: 7348

Best way to detect if a stream is zipped in Java

What is the best way to find out i java.io.InputStream contains zipped data?

Upvotes: 24

Views: 25355

Answers (7)

k3b
k3b

Reputation: 14755

I combined answers from @McDowell and @Innokenty to a small lib function that you can paste into you project:

public static boolean isZipStream(InputStream inputStream) {
    if (inputStream == null || !inputStream.markSupported()) {
        throw new IllegalArgumentException("InputStream must support mark-reset. Use BufferedInputstream()");
    }
    boolean isZipped = false;
    try {
        inputStream.mark(2048);
        isZipped = new ZipInputStream(inputStream).getNextEntry() != null;
        inputStream.reset();
    } catch (IOException ex) {
        // cannot be opend as zip.
    }
    return isZipped;
}

You can use the lib like this:

public static void main(String[] args) {
    InputStream inputStream = new BufferedInputStream(...);

    if (isZipStream(inputStream)) {
        // do zip processing using inputStream
    } else {
        // do non-zip processing using inputStream
    }

}

Upvotes: 0

Stone
Stone

Reputation: 673

Since both .zip and .xlsx having the same Magic number, I couldn't find the valid zip file (if renamed).

So, I have used Apache Tika to find the exact document type.

Even if renamed the file type as zip, it finds the exact type.

Reference: https://www.baeldung.com/apache-tika

Upvotes: 0

kk nair
kk nair

Reputation: 41

Checking the magic number may not be the right option.

Docx files are also having similar magic number 50 4B 3 4

Upvotes: 0

Innokenty
Innokenty

Reputation: 3293

Introduction

Since all the answers are 5 years old I feel a duty to write down, what's going on today. I seriously doubt one should read magic bytes of the stream! That's a low level code, it should be avoided in general.

Simple answer

miku writes:

If the Stream can be read via ZipInputStream, it should be zipped.

Yes, but in case of ZipInputStream "can be read" means that first call to .getNextEntry() returns a non-null value. No exception catching et cetera. So instead of magic bytes parsing you can just do:

boolean isZipped = new ZipInputStream(yourInputStream).getNextEntry() != null;

And that's it!

General unzipping thoughts

In general, it appeared that it's much more convenient to work with files while [un]zipping, than with streams. There are several useful libraries, plus ZipFile has got more functionality than ZipInputStream. Handling of zip files is discussed here: What is a good Java library to zip/unzip files? So if you can work with files you better do!

Code sample

I needed in my application to work with streams only. So that's the method I wrote for unzipping:

import org.apache.commons.io.IOUtils;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;

public boolean unzip(InputStream inputStream, File outputFolder) throws IOException {

    ZipInputStream zis = new ZipInputStream(inputStream);

    ZipEntry entry;
    boolean isEmpty = true;
    while ((entry = zis.getNextEntry()) != null) {
        isEmpty = false;
        File newFile = new File(outputFolder, entry.getName());
        if (newFile.getParentFile().mkdirs() && !entry.isDirectory()) {
            FileOutputStream fos = new FileOutputStream(newFile);
            IOUtils.copy(zis, fos);
            IOUtils.closeQuietly(fos);
        }
    }

    IOUtils.closeQuietly(zis);
    return !isEmpty;
}

Upvotes: 47

miku
miku

Reputation: 188014

Not very elegant, but reliable:

If the Stream can be read via ZipInputStream, it should be zipped.

Upvotes: 6

David Webb
David Webb

Reputation: 193696

You could check that the first four bytes of the stream are the local file header signature that starts the local file header that proceeds every file in a ZIP file, as shown in the spec here to be 50 4B 03 04.

A little test code shows this to work:

byte[] buffer = new byte[4];

try {
    ZipOutputStream zos = new ZipOutputStream(new FileOutputStream("so.zip"));
    ZipEntry ze = new ZipEntry("HelloWorld.txt");
    zos.putNextEntry(ze);
    zos.write("Hello world".getBytes());
    zos.close();

    FileInputStream is = new FileInputStream("so.zip");
    is.read(buffer);
    is.close();
}
catch(IOException e) {
    e.printStackTrace();
}

for (byte b : buffer) { 
    System.out.printf("%H ",b);
}

Gave me this output:

50 4B 3 4 

Upvotes: 6

McDowell
McDowell

Reputation: 108879

The magic bytes for the ZIP format are 50 4B. You could test the stream (using mark and reset - you may need to buffer) but I wouldn't expect this to be a 100% reliable approach. There would be no way to distinguish it from a US-ASCII encoded text file that began with the letters PK.

The best way would be to provide metadata on the content format prior to opening the stream and then treat it appropriately.

Upvotes: 24

Related Questions