Reputation: 7348
What is the best way to find out i java.io.InputStream
contains zipped data?
Upvotes: 24
Views: 25355
Reputation: 14755
I combined answers from @McDowell and @Innokenty to a small lib function that you can paste into you project:
public static boolean isZipStream(InputStream inputStream) {
if (inputStream == null || !inputStream.markSupported()) {
throw new IllegalArgumentException("InputStream must support mark-reset. Use BufferedInputstream()");
}
boolean isZipped = false;
try {
inputStream.mark(2048);
isZipped = new ZipInputStream(inputStream).getNextEntry() != null;
inputStream.reset();
} catch (IOException ex) {
// cannot be opend as zip.
}
return isZipped;
}
You can use the lib like this:
public static void main(String[] args) {
InputStream inputStream = new BufferedInputStream(...);
if (isZipStream(inputStream)) {
// do zip processing using inputStream
} else {
// do non-zip processing using inputStream
}
}
Upvotes: 0
Reputation: 673
Since both .zip and .xlsx having the same Magic number, I couldn't find the valid zip file (if renamed).
So, I have used Apache Tika to find the exact document type.
Even if renamed the file type as zip, it finds the exact type.
Reference: https://www.baeldung.com/apache-tika
Upvotes: 0
Reputation: 41
Checking the magic number may not be the right option.
Docx files are also having similar magic number 50 4B 3 4
Upvotes: 0
Reputation: 3293
Introduction
Since all the answers are 5 years old I feel a duty to write down, what's going on today. I seriously doubt one should read magic bytes of the stream! That's a low level code, it should be avoided in general.
Simple answer
miku writes:
If the Stream can be read via ZipInputStream, it should be zipped.
Yes, but in case of ZipInputStream
"can be read" means that first call to .getNextEntry()
returns a non-null value. No exception catching et cetera. So instead of magic bytes parsing you can just do:
boolean isZipped = new ZipInputStream(yourInputStream).getNextEntry() != null;
And that's it!
General unzipping thoughts
In general, it appeared that it's much more convenient to work with files while [un]zipping, than with streams. There are several useful libraries, plus ZipFile has got more functionality than ZipInputStream. Handling of zip files is discussed here: What is a good Java library to zip/unzip files? So if you can work with files you better do!
Code sample
I needed in my application to work with streams only. So that's the method I wrote for unzipping:
import org.apache.commons.io.IOUtils;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
public boolean unzip(InputStream inputStream, File outputFolder) throws IOException {
ZipInputStream zis = new ZipInputStream(inputStream);
ZipEntry entry;
boolean isEmpty = true;
while ((entry = zis.getNextEntry()) != null) {
isEmpty = false;
File newFile = new File(outputFolder, entry.getName());
if (newFile.getParentFile().mkdirs() && !entry.isDirectory()) {
FileOutputStream fos = new FileOutputStream(newFile);
IOUtils.copy(zis, fos);
IOUtils.closeQuietly(fos);
}
}
IOUtils.closeQuietly(zis);
return !isEmpty;
}
Upvotes: 47
Reputation: 188014
Not very elegant, but reliable:
If the Stream can be read via ZipInputStream
, it should be zipped.
Upvotes: 6
Reputation: 193696
You could check that the first four bytes of the stream are the local file header signature that starts the local file header that proceeds every file in a ZIP file, as shown in the spec here to be 50 4B 03 04
.
A little test code shows this to work:
byte[] buffer = new byte[4];
try {
ZipOutputStream zos = new ZipOutputStream(new FileOutputStream("so.zip"));
ZipEntry ze = new ZipEntry("HelloWorld.txt");
zos.putNextEntry(ze);
zos.write("Hello world".getBytes());
zos.close();
FileInputStream is = new FileInputStream("so.zip");
is.read(buffer);
is.close();
}
catch(IOException e) {
e.printStackTrace();
}
for (byte b : buffer) {
System.out.printf("%H ",b);
}
Gave me this output:
50 4B 3 4
Upvotes: 6
Reputation: 108879
The magic bytes for the ZIP format are 50 4B
. You could test the stream (using mark and reset - you may need to buffer) but I wouldn't expect this to be a 100% reliable approach. There would be no way to distinguish it from a US-ASCII encoded text file that began with the letters PK
.
The best way would be to provide metadata on the content format prior to opening the stream and then treat it appropriately.
Upvotes: 24