Reputation: 305
I'm trying to find the start of compressed data for each zip entry using zip4j. Great library for returning the local header offset, which Java's ZipFile does not do. However I'm wondering if there is a more reliable way than what I'm doing below to get the start of the compressed data? Thanks in advance.
offset = header.getOffsetLocalHeader();
offset += 30; //add fixed file header
offset += header.getFilenameLength(); // add filename field length
offset += header.getExtraFieldLength(); //add extra field length
//not quite the right number, sometimes have to add 4
//seems to be some header data that is outside the extra field value
offset += 4;
Edit Here is a sample zip: https://alexa-public.s3.amazonaws.com/test.zip
The code below decompresses each item properly but won't work without the +4.
String path = "/Users/test/Desktop/zip test/test.zip";
List<FileHeader> fileHeaders = new ZipFile(path).getFileHeaders();
for (FileHeader header : fileHeaders) {
long offset = 30 + header.getOffsetLocalHeader() + header.getFileNameLength() + header.getExtraFieldLength();
//fudge factor!
offset += 4;
RandomAccessFile f = new RandomAccessFile(path, "r");
byte[] buffer = new byte[(int) header.getCompressedSize()];
f.seek(offset);
f.read(buffer, 0, (int) header.getCompressedSize());
f.close();
Inflater inf = new Inflater(true);
inf.setInput(buffer);
byte[] inflatedContent = new byte[(int) header.getUncompressedSize()];
inf.inflate(inflatedContent);
inf.end();
FileOutputStream fos = new FileOutputStream(new File("/Users/test/Desktop/" + header.getFileName()));
fos.write(inflatedContent);
fos.close();
}
Upvotes: 1
Views: 1125
Reputation: 1488
We have found a more reliable library (apache compress) to get the data offset instead of the file header offset (so I don't need to make calculations on my own). The solution we are developing depends on two stages, the first is to index all files inside the zip file, and the second is to fetch only the required files using S3 range fetch. This solution works for both ZIP and ZIP64 formats.
import org.apache.commons.compress.archivers.zip.ZipFile;
import org.apache.commons.io.FileUtils;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.nio.file.Files;
import java.util.LinkedList;
import java.util.List;
import java.util.zip.Inflater;
import java.util.zip.InflaterInputStream;
public class ZipExtractor5 {
final static String ACCESS_KEY = "";
final static String SECRET_KEY = "";
final static String SESSION_TOKEN = "";
final static String BUCKET = "my-bucket";
final static String OBJECT_KEY = "input_folder/data.zip";
public static final String OUTPUT_FOLDER = "/home/output_folder/";
final static String ZIP_FILE = "/home/input_folder/data.zip";
public static void main(String[] args) throws Exception {
// INDEXING PHASE
List<ZipEntity> zipEntities = new LinkedList<>();
ZipFile zipFile = new ZipFile(ZIP_FILE);
var entries = zipFile.getEntries();
while (entries.hasMoreElements()) {
var entry = entries.nextElement();
ZipEntity zip = new ZipEntity();
zip.name = entry.getName();
zip.directory = entry.isDirectory();
zip.dataOffset = entry.getDataOffset();
zip.compressedSize = entry.getCompressedSize();
zip.size = entry.getSize();
zipEntities.add(zip);
// ZipEntities could be indexed and used by
// other applications to know which segments
// of the main file to fetch
}
// FETCHING PHASE
for (ZipEntity zipEntry : zipEntities) {
if (!zipEntry.directory) {
long offset = zipEntry.dataOffset;
long end = offset + zipEntry.compressedSize;
byte[] data = readFileRange(OBJECT_KEY, offset, end);
InputStream inputStream;
if (zipEntry.compressedSize == zipEntry.size) {
inputStream = new ByteArrayInputStream(data);
} else {
inputStream = new InflaterInputStream(new ByteArrayInputStream(data), new Inflater(true));
}
File outputFile = new File(OUTPUT_FOLDER + zipEntry.name);
Files.deleteIfExists(outputFile.toPath());
FileUtils.copyInputStreamToFile(inputStream, outputFile);
}
}
}
public static byte[] readFileRange(String filename, long start, long end) throws Exception {
S3Client s3Client = S3Client.builder()
.credentialsProvider(StaticCredentialsProvider
.create(AwsSessionCredentials.create(ACCESS_KEY, SECRET_KEY, SESSION_TOKEN)))
.region(Region.US_WEST_2)
.build();
return IOUtils.toByteArray(s3Client.getObject(
GetObjectRequest.builder()
.bucket(BUCKET)
.key(filename)
.range("bytes=%d-%d".formatted(start, end))
.build()));
}
public static class ZipEntity {
public long compressedSize;
public long dataOffset;
public String name;
public boolean directory;
public long size;
}
}
Upvotes: 0
Reputation: 1956
The reason you have to add 4 to the offset in your example is because the size of the extra data field in central directory of this entry (= file header) is different than the one in local file header, and it is perfectly legal as per zip specification to have different extra data field sizes in central directory and local header. In fact the extra data field we are talking about, Extended Timestamp extra field (signature 0x5455), has an official definition which has varied lengths between the two.
Extended Timestamp extra field (signature 0x5455)
Local-header version:
| Value | Size | Description |
| ------------- |---------------|---------------------------------------|
| 0x5455 | Short | tag for this extra block type ("UT") |
| TSize | Short | total data size for this block |
| Flags | Byte | info bits |
| (ModTime) | Long | time of last modification (UTC/GMT) |
| (AcTime) | Long | time of last access (UTC/GMT) |
| (CrTime) | Long | time of original creation (UTC/GMT) |
Central-header version:
| Value | Size | Description |
| ------------- |---------------|---------------------------------------|
| 0x5455 | Short | tag for this extra block type ("UT") |
| TSize | Short | total data size for this block |
| Flags | Byte | info bits |
| (ModTime) | Long | time of last modification (UTC/GMT) |
In the sample zip file you have attached, the tool which creates the zip file adds a 4 byte additional information compared to the central directory for this extra field.
Relying on the extra field length from central directory to reach to start of data can be error prone. A more reliable way to achieve what you want is to read the extra field length from local header. I have modified your code slightly to consider the extra field length from local header and not from central header to reach to the start of data.
import net.lingala.zip4j.model.FileHeader;
import net.lingala.zip4j.util.RawIO;
import org.junit.Test;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.List;
import java.util.zip.DataFormatException;
import java.util.zip.Inflater;
public class ZipTest {
private static final int OFFSET_TO_EXTRA_FIELD_LENGTH_SIZE = 28;
private RawIO rawIO = new RawIO();
@Test
public void testExtractWithDataOffset() throws IOException, DataFormatException {
String basePath = "/Users/slingala/Downloads/test/";
String path = basePath + "test.zip";
List<FileHeader> fileHeaders = new ZipFile(path).getFileHeaders();
for (FileHeader header : fileHeaders) {
RandomAccessFile f = new RandomAccessFile(path, "r");
byte[] buffer = new byte[(int) header.getCompressedSize()];
f.seek(OFFSET_TO_EXTRA_FIELD_LENGTH_SIZE);
int extraFieldLength = rawIO.readShortLittleEndian(f);
f.skipBytes(header.getFileNameLength() + extraFieldLength);
f.read(buffer, 0, (int) header.getCompressedSize());
f.close();
Inflater inf = new Inflater(true);
inf.setInput(buffer);
byte[] inflatedContent = new byte[(int) header.getUncompressedSize()];
inf.inflate(inflatedContent);
inf.end();
FileOutputStream fos = new FileOutputStream(new File(basePath + header.getFileName()));
fos.write(inflatedContent);
fos.close();
}
}
}
On a side note, I wonder why you want to read the data, deal with inflater and extract the content yourself? With zip4j you can extract all entires with ZipFile.extractAll()
or you can also extract each entry in the zip file with streams if you wish with ZipFile.getInputStream()
. A skeleton example is:
ZipFile zipFile = new ZipFile("filename.zip");
FileHeader fileHeader = zipFile.getFileHeader("entry_name_in_zip.txt");
InputStream inputStream = zipFile.getInputStream(fileHeader);
Once you have the inputstream, you can read the content and write it to any outputstream. This way you can extract each entry in the zip file without having to deal with the inflaters yourself.
Upvotes: 4