Reputation: 612
I have a zip archive that contains several gzip files. But gzip file's extentions are also .zip
. I walk through zip archive with ZipInputStream. How can I detect inner file's type with reading its content rather than extentions. I also need not to change (or reset) ZipInputStream position.
So I need;
Example:
root/1.zip/2.zip/3.zip(actually 3 is gzip)/4.txt
Sample Java Code:
public static void main(String[] args) {
//root/1.zip/2.zip/3.zip(actually 3 is gzip)/4.txt
String file = "root/1.zip";
File rootZip = new File(file);
try (FileInputStream fis = new FileInputStream(rootZip)) {
lookupInZip(fis)
.stream()
.forEach(System.out::println);
} catch (IOException e) {
System.out.println("Failed to get files");
}
}
public static List<String> lookupInZip(InputStream inputStream) throws IOException {
Tika tika = new Tika();
List<String> paths = new ArrayList<>();
ZipInputStream zipInputStream = new ZipInputStream(inputStream);
ZipEntry entry = zipInputStream.getNextEntry();
while (entry != null) {
String entryName = entry.getName();
if (!entry.isDirectory()) {
//Option 1
//String fileType = tika.detect(entryName);
//Option 2
String fileType = tika.detect(zipInputStream);
if ("application/zip".equals(fileType)) {
List<String> innerPaths = lookupInZip(zipInputStream);
paths.addAll(innerPaths);
} else {
paths.add(entryName);
}
}
entry = zipInputStream.getNextEntry();
}
return paths;
}
If I use option 1, '3.zip' is evaluated as zip file but it is gzip. If I use option 2, '2.zip' is evaluated as zip correctly by using its content. But when lookupInZip() is called for '3.zip' recursively, zipInputStream.getNextEntry() returns null. Because in previous step, we use inputStream content to detect type and inputStrem position changed.
Note: tika.detect() uses BufferedInputStream in implementation to reset inputStream position but it does not solve my problem.
Upvotes: 0
Views: 1577
Reputation: 112597
The first two bytes are enough to see if it is likely a zip file, likely a gzip file, or certainly something else.
If the first two bytes are 0x50 0x4b
, then it is likely a zip file. If the first two bytes are 0x1f 0x8b
, then it is likely a gzip file. If it is neither, then the file is something else.
The first two bytes matching is not a guarantee it is that type, but it appears from your structure that it is usually one or the other, and you can use the extension as further corroborating evidence that it is compressed.
As for not changing the position, you need a way to peek at the first two bytes without advancing the position, or a way to get them and then unget them to return the position to where it was.
Upvotes: 1