When changing a PDF and then removing the change, the hashes of the restored file and the original file are different

Question

If I accesses a PDF to add something in custom property using the code File src_2 = new File(embed_source); File dest_2 = new File(embed_destination_2);

                    try {
                        FileUtils.copyFile(src_2, dest_2);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }          
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
            PdfReader reader = new PdfReader(src);
            PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
            Map info = reader.getInfo();
            System.out.println(info.get("Lala"));

            stamper.setMoreInfo((HashMap) info);
            stamper.close();
            reader.close();
        }

I did not change anything about the src file, what i did is only to get some information about the src file. However, I have got 2 different hash results from the src file before and after i run the program. May I know why?

Bruno Lowagie · Accepted Answer

If you read ISO-32000-1, you should know that no two PDFs are equal by design. One of the most typical differences between two PDFs is the ID:

From ISO-32000-1:

ID: An array of two byte-strings constituting a file identifier.

From Section 14.4, entitled "file identifiers":

The value of this entry shall be an array of two byte strings. The first byte string shall be a permanent identifier based on the contents of the file at the time it was originally created and shall not change when the file is incrementally updated. The second byte string shall be a changing identifier based on the file’s contents at the time it was last updated. When a file is first written, both identifiers shall be set to the same value. If both identifiers match when a file reference is resolved, it is very likely that the correct and unchanged file has been found. If only the first identifier matches, a different version of the correct file has been found.

If you create a PDF from scratch, the ID consists of two identical identifiers. When you update the PDF to add something, the first ID is preserved, the second ID is changed. If you update the PDF to remove that something, that second ID is again changed, but by definition, it should not be identical to the first ID, because you are at a different part of the workflow.

Note: there aren't that many tools that create PDFs of which the identifiers are identical. That's because the PDF that is created from scratch is usually manipulated before the final version is saved to disk. Just create a PDF using Adobe Acrobat to reproduce this: you'll notic that the identifier pair consists of two different values. This makes that it is useless to ask: can we create a situation where we make the second identifier identical to the first one?

Moreover: it is inherent to PDF that the way objects are organized is random. Your use case using hashes goes against the PDF standard.

How to solve this problem?

You are the same person who asked the question [how to] Add / delete / retrieve information from a PDF using a custom property

In my answer to this question, I explain how to add metadata to an existing PDF:

PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));

This creates a new PDF file in which objects are being reordered.

However, you can change this line into:

PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest), '\0', true);

Now you are creating an incremental update of your PDF file.

What is an incremental update?

Suppose that your original PDF file looks like this:

%PDF-1.4
% plenty of PDF objects and PDF syntax
%%EOF

When you use iText to manipulate such a file, you get an altered PDF file:

%PDF-1.4
% plenty of altered PDF objects and altered PDF syntax
%%EOF

During this process, objects can be renumbered, reorganized, etc... If you add something in a first go, and remove something in a second go, you can expect that the PDF looks the same to the human eye when opening the document in a PDF viewer, but you should not expect the PDF syntax to be identical. That assumption would reveal a total lack of insight in the PDF format.

However, when you use PdfStamper in append mode to perform an incremental update, you get an incrementally updated PDF:

%PDF-1.4
% plenty of PDF objects and PDF syntax
%%EOF
% updates for PDF objects and PDF syntax
%%EOF

In this case, the original bytes of the original PDF aren't changed. The file size gets bigger because you'll now have some redundant information (some objects will no longer be used, of some objects you'll have an old version along with a new version), but the advantage of using an incremental update is that you can always go back to the original file.

It's sufficient to search for the second last appearance of %%EOF and to remove all the bytes that follow, you'll get a truncated PDF file:

%PDF-1.4
% plenty of PDF objects and PDF syntax
%%EOF

You can now take a hash of this truncated PDF file and compare it with the hash of the original PDF file. These hashes will be identical.

Caveat: beware of the whitespace characters that follow %%EOF. They can cause a minimal difference at the byte level that causes the hashes to be different.

When changing a PDF and then removing the change, the hashes of the restored file and the original file are different

Answers (1)

Related Questions