Reputation: 530

Comparing two PDF files text using PDFBox is failing eventhough both files are having same text

I am using PDFBOX as a utility in my selenium automation for export testing . We are comparing actual exported pdf file with the expected ones using pdfbox and then pass/fail test accordingly. This works pretty much smoothly . However recently I came across actual exported file , which looks as same as expected one (as far as data is concerned) , however when comparing it with pdfbox , it is failing

Expected pdf file

Actual pdf file

Below is the general utility i am using to compare pdf files

    private static void arePDFFilesEqual(File pdfFile1, File pdfFile2) throws IOException
{
    LOG.info("Comparing PDF files ("+pdfFile1+","+pdfFile2+")");
    PDDocument pdf1 = PDDocument.load(pdfFile1);
    PDDocument pdf2 = PDDocument.load(pdfFile2);
    PDPageTree pdf1pages = pdf1.getDocumentCatalog().getPages();
    PDPageTree pdf2pages = pdf2.getDocumentCatalog().getPages();
    try
    {
        if (pdf1pages.getCount() != pdf2pages.getCount())
        {
            String message = "Number of pages in the files ("+pdfFile1+","+pdfFile2+") do not match. pdfFile1 has "+pdf1pages.getCount()+" no pages, while pdf2pages has "+pdf2pages.getCount()+" no of pages";
            LOG.debug(message);
            throw new TestException(message);
        }
        PDFTextStripper pdfStripper = new PDFTextStripper();
        LOG.debug("pdfStripper is :- " + pdfStripper);
        LOG.debug("pdf1pages.size() is :- " + pdf1pages.getCount());
        for (int i = 0; i < pdf1pages.getCount(); i++)
        {
            pdfStripper.setStartPage(i + 1);
            pdfStripper.setEndPage(i + 1);
            String pdf1PageText = pdfStripper.getText(pdf1);
            String pdf2PageText = pdfStripper.getText(pdf2);
            if (!pdf1PageText.equals(pdf2PageText))
            {
                String message = "Contents of the files ("+pdfFile1+","+pdfFile2+") do not match on Page no: " + (i + 1)+" pdf1PageText is : "+pdf1PageText+" , while pdf2PageText is : "+pdf2PageText;
                LOG.debug(message);
                System.out.println("fff");
                LOG.debug("pdf1PageText is " + pdf1PageText);
                LOG.debug("pdf2PageText is " + pdf2PageText);
                String difference = StringUtils.difference(pdf1PageText, pdf2PageText);
                LOG.debug("difference is "+difference);
                throw new TestException(message+" [[ Difference is ]] "+difference);
            }
        }
        LOG.info("Returning True , as PDF Files ("+pdfFile1+","+pdfFile2+") get matched");
    } finally {
        pdf1.close();
        pdf2.close();
    }
}

Eclipse shows this differences in console

https://s3.amazonaws.com/uploads.hipchat.com/95223/845692/9Ex0QW2fFeRqu8s/upload.png

I can see it is failing because of symbols like (curley braces , {} , hash # , exclamation mark !) however i don't know how to fix this one ..

Can anyone please tell me how to fix this one ?

Upvotes: 3

Answers (2)

Max Vollmer

Reputation: 8598

This is a tough one, since similar or even the same Unicode characters might have different byte representation, depending on font, encoding and other factors during PDF generation.

A possible solution I can think of if you can safely assume that the relevant text pieces are represented by 8 bit characters:

String stripUnicode(String s) {
    StringBuilder sb = new StringBuilder(s.length());
    for (char c : s.toCharArray()) {
        if (c <= 0xFF) {
            sb.append(c);
        }
    }
    return sb.toString();
}

...

String pdf1PageText = pdfStripper.getText(pdf1);
String pdf2PageText = pdfStripper.getText(pdf2);
if (!stripUnicode(pdf1PageText).equals(stripUnicode(pdf2PageText)))
...

If you need Unicode support, you need to implement your own custom comparison algorithm that is able to identify similar characters and treat them as equal.

Upvotes: 1

mkl

Reputation: 96039

However recently I came across actual exported file , which looks as same as expected one (as far as data is concerned) , however when comparing it with pdfbox , it is failing

That this might happen, should not surprise you. After all your test does not compare the looks of the pages in question but the results of text extraction.

While the look of textual data on the pages depends on the drawing instructions for the glyphs in question in the respective (in case of your files) embedded font file, the result of text extraction of the same textual data on the pages depends on the ToUnicode table or Encoding value of the PDF font information structures for that font file.

And indeed, while the textual data of the expected and the actual document use the same glyphs of the respective fonts, the ToUnicode tables in the expected and the actual document for one font claim that certain glyphs represent different Unicode code points.

The font in question has these three glyphs:

The ToUnicode map for that font in your expected document contains the mappings

<0000> <0000> <0000>
<0001> <0002> [<F125> <F128> ]

which claim that these three characters correspond to U+0000, U+F125, and U+F128.

The ToUnicode map for that font in your actual document contains the mappings

<0000> <0000> <0000>
<0001> <0002> [<F126> <F129> ]

which claim that these three characters correspond to U+0000, U+F126, and U+F129.

Thus, your test correctly has found a difference between expected and actual document, so its failure result is correct. Thus, you don't have to fix anything, the software producing the actual document has an issue!

(One could argue that the differences are inside Unicode private use areas and don't matter. In that case you'd have to update your test to ignore differences of characters from Unicode private use areas. But that should have been told you before you started creating tests.)

Upvotes: 3

Comparing two PDF files text using PDFBox is failing eventhough both files are having same text

Answers (2)

Related Questions