Reputation: 321
How can we revert last incremental update done in a pdf using pdfbox ?
For e.g. Original document Signed document
When I digitally sign(certification signature) an original document using incremental save, I get a signed document. Upon inspecting the source of signed document, I could see that "%%EOF" is presenting 2 times. If I manually remove last "%%EOF" along with its content, I could see PDF returns to its initial state, which is very similar to original document.
How can I do this pragmatically ?
I am using PDFBOX v2.0.8
Best Regards, Abhishek
Upvotes: 3
Views: 1103
Reputation: 96064
There are more advanced approaches and there are less advanced ones.
This is the most simple one: It searches the %%EOF
marker and cuts off right thereafter. This might not be identical to the original previous revision because that marker may be followed by an optional end-of-line marker. Unless that previous revision is signed or linearized, though, the variant with the end-of-line marker and the one without are equivalent as PDF files.
For searching the %%EOF
marker we use the StreamSearcher
class from the twitter/elephant-bird project, cf. this earlier stack overflow answer:
public List<Long> simpleApproach(InputStream pdf) throws IOException {
StreamSearcher streamSearcher = new StreamSearcher("%%EOF".getBytes());
List<Long> results = new ArrayList<>();
long revisionSize = 0;
long diff;
while ((diff = streamSearcher.search(pdf)) > -1) {
revisionSize += diff;
results.add(revisionSize);
}
return results;
}
For copying only the desired number of bytes, we use the Guava ByteStreams
class. (There are many alternatives, e.g. Apache Commons IO, but Guava happened to already be in my test project dependencies.)
List<Long> simpleSizes = null;
try ( InputStream resource = GET_DOCUMENT_INPUTSTREAM) {
simpleSizes = simpleApproach(resource);
}
if (1 < simpleSizes.size()) {
try ( InputStream resource = GET_DOCUMENT_INPUTSTREAM;
OutputStream file = new FileOutputStream("previousRevision.pdf")) {
InputStream revision = ByteStreams.limit(resource, simpleSizes.get(simpleSizes.size() - 2));
ByteStreams.copy(revision, file);
}
}
GET_DOCUMENT_INPUTSTREAM
might be a new FileInputStream(PDF_PATH)
or new ByteArrayInputStream(PDF_BYTES)
or whatever means you have to repeatedly retrieve an InputStream
for the PDF. In case of these examples (FileInputStream
, ByteArrayInputStream
) you can even re-use the same stream using reset()
.
Upvotes: 3