Reputation: 5
I'm already succesfully extracting images from PDF's since a few years. I use itextsharp to do this. I get a PdfImageObject and get the filter. Mostly this filter is "/FlateDecode". In that case,I use pdf.PdfReader.FlateDecode(bytes, True) to decode the raw bytes.
But recently I'm confronted with pdf's with PdfImageObjects with filter: "[/FlateDecode, /RunLengthDecode]".
So I guess that the raw bytes must be decoded twice!?!?
I found some code on the internet for the /RunLengthDecode part: https://github.com/kusl/itextsharp/blob/master/tags/iTextSharp_5_4_5/src/core/iTextSharp/text/pdf/FilterHandlers.cs
I tried to apply both decode options on the image. First /FlateDecode and then /RunLengthDecode. And second /RunLengthDecode and then /FlateDecode.
But the /RunLengthDecode code gives me in both scenarios an error.
Upvotes: 0
Views: 3035
Reputation: 95918
This actually is not an answer to the question as is but an analysis of the problem that led to this question.
In comments to the question it turned out that a bug in iText is the reason why the OP tries to manually filter raw streams and extract images: Certain images were extracted with small errors. The OP identified the problematic images to be those with filters [/FlateDecode, /RunLengthDecode]
.
The bug in question indeed is iText's implementation of the RunLengthDecode filter, here from iText for .Net 5.5.x:
private class Filter_RUNLENGTHDECODE : IFilterHandler {
public byte[] Decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary) {
// allocate the output buffer
MemoryStream baos = new MemoryStream();
sbyte dupCount = -1;
for (int i = 0; i < b.Length; i++){
dupCount = (sbyte)b[i];
if (dupCount == -128) break; // this is implicit end of data
if (dupCount >= 0 && dupCount <= 127){
int bytesToCopy = dupCount+1;
baos.Write(b, i, bytesToCopy);
i+=bytesToCopy;
} else {
// make dupcount copies of the next byte
i++;
for (int j = 0; j < 1-(int)(dupCount);j++){
baos.WriteByte(b[i]);
}
}
}
return baos.ToArray();
}
}
More exactly it is this line:
baos.Write(b, i, bytesToCopy);
It should have copied the next bytesToCopy
bytes after index i
-- at index i
there is the count value after all -- but this command copies the next bytesToCopy
bytes starting at index i
. Thus, for every run of bytes to copy once iText instead first copies the count byte and then all but the final byte of the run.
Instead the line should be
baos.Write(b, i+1, bytesToCopy);
As runs of duplicate bytes are correctly extracted and even for long, non-duplicate runs there are many correct bytes (at off-by-one positions), the images iText extracted only look slightly wrong with small errors, e.g.:
Damaged image:
Undamaged image:
This bug has been in iText 5.x for .Net for many years. Furthermore, it has also been present in iText 5.x for Java for many years and still is, e.g. here from the current 5.5.13-SNAPSHOT:
private static class Filter_RUNLENGTHDECODE implements FilterHandler{
public byte[] decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary) throws IOException {
// allocate the output buffer
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte dupCount = -1;
for(int i = 0; i < b.length; i++){
dupCount = b[i];
if (dupCount == -128) break; // this is implicit end of data
if (dupCount >= 0 && dupCount <= 127){
int bytesToCopy = dupCount+1;
baos.write(b, i, bytesToCopy);
i+=bytesToCopy;
} else {
// make dupcount copies of the next byte
i++;
for(int j = 0; j < 1-(int)(dupCount);j++){
baos.write(b[i]);
}
}
}
return baos.toByteArray();
}
}
and in iText 7, e.g. here from the current 7.1.2-SNAPSHOT for Java:
public class RunLengthDecodeFilter implements IFilterHandler {
@Override
public byte[] decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary) {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte dupCount;
for (int i = 0; i < b.length; i++) {
dupCount = b[i];
if (dupCount == (byte) 0x80) { // this is implicit end of data
break;
}
if (dupCount >= 0) {
int bytesToCopy = dupCount + 1;
baos.write(b, i, bytesToCopy);
i += bytesToCopy;
} else { // make dupcount copies of the next byte
i++;
for (int j = 0; j < 1 - (int) (dupCount); j++) {
baos.write(b[i]);
}
}
}
return baos.toByteArray();
}
}
Most likely this bug could remain that long because the RunLengthDecode filter hardly ever has been used for a number of years.
Upvotes: 3