Abhishek Dadhich
Abhishek Dadhich

Reputation: 321

How to revert incremental update in pdf using PDFBOX?

How can we revert last incremental update done in a pdf using pdfbox ?

For e.g. Original document Signed document

When I digitally sign(certification signature) an original document using incremental save, I get a signed document. Upon inspecting the source of signed document, I could see that "%%EOF" is presenting 2 times. If I manually remove last "%%EOF" along with its content, I could see PDF returns to its initial state, which is very similar to original document.

How can I do this pragmatically ?

I am using PDFBOX v2.0.8

Best Regards, Abhishek

Upvotes: 3

Views: 1103

Answers (1)

mkl
mkl

Reputation: 96064

There are more advanced approaches and there are less advanced ones.

This is the most simple one: It searches the %%EOF marker and cuts off right thereafter. This might not be identical to the original previous revision because that marker may be followed by an optional end-of-line marker. Unless that previous revision is signed or linearized, though, the variant with the end-of-line marker and the one without are equivalent as PDF files.

For searching the %%EOF marker we use the StreamSearcher class from the twitter/elephant-bird project, cf. this earlier stack overflow answer:

public List<Long> simpleApproach(InputStream pdf) throws IOException {
    StreamSearcher streamSearcher = new StreamSearcher("%%EOF".getBytes());
    List<Long> results = new ArrayList<>();
    long revisionSize = 0;
    long diff;
    while ((diff = streamSearcher.search(pdf)) > -1) {
        revisionSize += diff;
        results.add(revisionSize);
    }
    return results;
}

For copying only the desired number of bytes, we use the Guava ByteStreams class. (There are many alternatives, e.g. Apache Commons IO, but Guava happened to already be in my test project dependencies.)

List<Long> simpleSizes = null;
try (   InputStream resource = GET_DOCUMENT_INPUTSTREAM) {
    simpleSizes = simpleApproach(resource);
}

if (1 < simpleSizes.size()) {
    try (   InputStream resource = GET_DOCUMENT_INPUTSTREAM;
            OutputStream file = new FileOutputStream("previousRevision.pdf")) {
        InputStream revision = ByteStreams.limit(resource, simpleSizes.get(simpleSizes.size() - 2));
        ByteStreams.copy(revision, file);
    }
}

GET_DOCUMENT_INPUTSTREAM might be a new FileInputStream(PDF_PATH) or new ByteArrayInputStream(PDF_BYTES) or whatever means you have to repeatedly retrieve an InputStream for the PDF. In case of these examples (FileInputStream, ByteArrayInputStream) you can even re-use the same stream using reset().

Upvotes: 3

Related Questions