fuyun shenma
fuyun shenma

Reputation: 1

how to get pdf origin contents using itext

I will make the problem concrete. I currently have three PDFs

The first PDF is a pure PDF without any signature. The link is as follows, https://drive.google.com/file/d/14gPZaL2AClRlPb5R2FQob4BBw31vvqYk/view?usp=sharing The second PDF, I digitally signed the first PDF using adobe_acrobat_dc, the link is here, https://drive.google.com/file/d/1CSrWV7SKrWUAJAf2uhwRZ8ephGa_uYYs/view?usp=sharing,

The third PDF is generated like this, I used the code you once provided as below

        com.itextpdf.kernel.pdf.PdfReader pdfReader = new com.itextpdf.kernel.pdf.PdfReader(new 
        FileInputStream("C:\\Users\\Dell\\Desktop\\test2.pdf"));
        com.itextpdf.kernel.pdf.PdfDocument pdfDocument = new com.itextpdf.kernel.pdf.PdfDocument(pdfReader);
        SignatureUtil signatureUtil = new SignatureUtil((pdfDocument));
        for(String name: signatureUtil.getSignatureNames()){
            System.out.println(name);
            PdfSignature signature = signatureUtil.getSignature(name);
            PdfArray b = signature.getByteRange();
            long[] longs = b.asLongArray();
            RandomAccessFileOrArray rf = pdfReader.getSafeFile();
            try (InputStream rg = new RASInputStream(new RandomAccessSourceFactory().createRanged(rf.createSourceView(),longs));
                 ByteArrayOutputStream byteArrayOutputStream = new com.itextpdf.io.source.ByteArrayOutputStream();) {
                byte[] buf = new byte[8192];
                int rd;
                while ((rd = rg.read(buf, 0, buf.length)) > 0) {
                    byteArrayOutputStream.write(buf, 0, rd);
                }
                byte[] bytes1 = byteArrayOutputStream.toByteArray();

                
                String s2 = DatatypeConverter.printBase64Binary(bytes1);
                
                }
}
                

Process the second PDF to get the base64 encoded form of the third PDF, finally,the third pdf link is https://drive.google.com/file/d/1LSbZpaVT9GrfotXplmKWl6HaCvxmaoH9/view?usp=sharing

My question is, is there a method which the input parameter is the first PDF and the output is the third PDF

Upvotes: 0

Views: 219

Answers (1)

mkl
mkl

Reputation: 95918

If I understand you correctly, you start with an unsigned PDF document test1.pdf. You sign it using Adobe Acrobat and get a signed PDF document test2.pdf. Then you apply your code to that signed PDF and get a file test3.pdf.

And now you wonder whether you can get test3.pdf immediately from test1.pdf some other way, independent from the specific signing step done in Adobe Acrobat.

This is not possible in practice.

Signing a PDF does not merely append a few signature related attributes, it can completely re-organize the PDF internally!

For example, your original test1.pdf is a normally saved PDF with cross reference tables. Adobe Acrobat saved the signed document as a linearized PDF with object streams and cross reference streams. Also all the PDF objects are renumbered. This causes a byte-wise comparison of test1.pdf and test2.pdf to hardly find any similarities.

All these changes are not necessary for signing but merely represent Acrobat's preferred way of saving a hitherto unsigned PDF. Thus, after the next program update Acrobat may or may not change this behavior completely without prior notice.

But even if Acrobat only saved necessary changes (whenever it saves as an incremental update, it forgoes most unnecessary changes), there would still be multiple valid ways to format them.

Additionally there are multiple date and version information pieces. E.g. signing, creation, and modification time; also the signature in test2.pdf claims to have been created by Adobe Acrobat Pro DC version 2018.011.20038. A small change in the software used or in the timing of the use will create different information in the result file.

And as the output of your code, your third file, contains everything of test2.pdf except the embedded signature container, all the changes mentioned above are also in your third file.


Concerning the terms you use:

You call the output of the code you posted original content or original text (in your previous question here). This is a bit of a misnomer because that output does contain all the changes introduced by the signing program, in your example all the re-organization of the objects in the PDF by Adobe Acrobat, so it is not really original. This output merely are the signed bytes or signed byte ranges in the signed PDF.

Furthermore, you call that output a pdf. Strictly speaking it is not a PDF anymore, at least not a valid one. By removal of (the placeholder for) the signature container, the signature dictionary is broken and all offsets in the file after that missing value have shifted.

Upvotes: 1

Related Questions