Reputation: 2549
In my PDF, there is an object
<</Filter/FlateDecode/First 721/Length 3424/N 79/Type/ObjStm>>stream
The raw data on a next line start with bytes
eKoq... precisely [101, 75, 111, 113, 22, 229, 156, 253, 116, ...
My Flate decoder fails on this input. How should it be processed then?
http://s000.tinyupload.com/?file_id=25511328881895019912
Upvotes: 4
Views: 21893
Reputation: 4883
This PDF is encrypted. PDF file trailer is:
endobj
startxref
116
%%EOF
Cross reference stream @byte offset 116 (with some formatting) is:
<</DecodeParms<</Columns 5/Predictor 12>>
/Encrypt 389 0 R
% ... etc
/Type/XRef /W[1 3 1]
>> stream
Encryption dictionary 389 0 R (formatted) is:
389 0 obj <<
/CF <<
/StdCF <<
/AuthEvent /DocOpen
/CFM /AESV2
/Length 16
>>
>>
/EncryptMetadata false
/Filter /Standard
/O (...) % binary owner key
/P -1084
/R 4
/StmF /StdCF
/StrF /StdCF
/U (...) % binary user key
/V 4
/Length 128
>>
endobj
The PDF 32000 ISO States:
7.6.1 General A PDF document can be encrypted (PDF 1.1) to protect its contents from unauthorized access. Encryption applies to all strings and streams in the document's PDF file, with the following exceptions:
• The values for the ID entry in the trailer
• Any strings in an Encrypt dictionary
• Any strings that are inside streams such as content streams and compressed object streams, which themselves are encrypted
The referenced object is a content stream in an encrypted PDF. In order to process this stream, you need to implement encryption (AESV2 in this case) and decrypt streams before applying other filters.
Note: this PDF is encrypted with a blank user password, so it opens in most viewers without the need to enter a user password.
Upvotes: 10
Reputation: 87
You have <>stream(blah blah)endstream
First use zlib to inflate the (blah blah) stream data.
If you use python3, its really simple. Just grab all the data between stream and endstream, and pass it through.
results = zlib.decompress(b'(blah blah)')
If you're using c++ and the zlib library, use a function like this.
int inflate(std::string source, std::string &destination)
{
size_t srcLen = source.size();
int err = Z_BUF_ERROR;
size_t destLen = srcLen;
while (err == Z_BUF_ERROR)
{
destLen = destLen * 3;
char *dest = (char *)malloc(destLen);
if (dest == nullptr)
{
return Z_MEM_ERROR;
}
err = uncompress((Bytef *)dest, &destLen, (Bytef *)source.data(), source.size());
destination = std::string(dest, destLen);
free(dest);
}
return err;
}
The inflated content will be a sequence of numbers followed by some PDF objects (usually dictionaries) e.g. "123 0 124 25 <><>"
In this example, 123 is the object indirect reference number, and 0 is the byte offset after the N pairs
For more reading see page 53 and 54 of the specification. https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
Upvotes: -2
Reputation: 1215
If it crashes that would indicate you have a bug in your Flate decoder. I can't examine it but even if the stream is invalid your PDF software ideally shouldn't crash.
Upvotes: -2