Matching text parsed from a PDF with PDFBox

Question

This is more of a learning than a question. I was recently struggling with matching strings parsed out of a PDF using PDFBox. My solution might be helpful to others

A list of text was obtained from the PDF using PDFBox like this (Exceptions omitted for brevity):

List lines = new ArrayList();
PDDocument document = PDDocument.load(f);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
String[] pageText = text.trim().split(pdfStripper.getLineSeparator());
for (String line : pageText) {
        lines.add(line);
}

The List now contains all the lines from the file in order.

However, String.contains and String.equals fails on lines that are seemingly identical in the logs (ie: 'EMERA INCORPORATED'). In converting each characters into a Hex, it became clear the Space character was the issue:

Line (Parsed from PDF with PDF Box): EMERA INCORPORATED
45 4d 45 52 41 a0 49 4e 43 4f 52 50 4f 52 41 54 45 44

CompanyName (Set In Java): EMERA INCORPORATED
45 4d 45 52 41 20 49 4e 43 4f 52 50 4f 52 41 54 45 44

Note the 'a0' in the PDFBox String where in Java there is the space ('20').

The solution was to use Regex to identify the line: EMERA\S+INCORPORATED. This gives better controller over the matching, so its not bad. But it was a bit annoying to figure this out as when reviewing the logs, the Strings being compared looked identical, yet both contains and equals returned false.

My conclusion, use RegEx to identify text patterns coming out of a PDF (obtained with PDFBox) and ensure to add '\S' to represent potential spaces. Maybe this post can save someone some pain. Also, perhaps someone more familiar with PDFBox could provide tips on using the API better if this is user error on my part.

mkl · Accepted Answer

perhaps someone more familiar with PDFBox could provide tips on using the API better if this is user error on my part

It is not an error in PDFBox API usage. It is not even specific to PDFBox at all. It more is a matter of wrong expectations.

Different kinds of space characters

First of all, there are different kinds of space characters. There of course is the most often used Unicode Character 'SPACE' (U+0020) but there also are others, in particular the Unicode Character 'NO-BREAK SPACE' (U+00A0).

Thus, if you don't know that only one particular space character is used in a given text, it is completely normal to use regular expressions with '\S' instead of ' '.

What does PDFBox extract?

In the case at hand using the non breaking space was not even used by choice of PDFBox. Instead, it was ingrained in the PDF.

When extracting text from a PDF, PDFBox (just like other PDF libraries) uses the information inside the PDF concerning which glyph represents which Unicode character. This information can be given by an Encoding entry or an ToUnicode entry of the respective font declaration in the PDF.

Only if there is a gap between two text chunks (a free space not created by drawing a space character but by moving the text insertion point without a text character), PDF text extractors add a space character of their respective choice, usually the regular space.

As PDFBox does use the regular space in the later case, the issue at hand is a situation of the first case, the PDF itself indicates that the space there is a non breaking one.

Matching text parsed from a PDF with PDFBox

Answers (1)

Different kinds of space characters

What does PDFBox extract?

Related Questions