Remove OCR text from PDF document using Apache PDFBox

Question

Some of PDF documents in system were created by scanning with OCR text included. However, OCR was not performed correctly (mixed up Cyrillic and Latin characters) and although the document looks like searchable, that information is completely incorrect and unusable.

When looking at PDF document in Adobe Acrobat Reader DC (or Google Chrome) it is displayed correctly, but on a web page that uses PDF.js to render the document, the OCR text shows in front, instead of scanned graphical presentation of original text.

The idea is to "repair" these documents by removing OCR text from PDF document, while preserving scanned graphical presentation of original text.

For that purpose I have used Apache PDFBox 2.0.11 to inspect the contents of the PDF document. The following code snippet prints out the entire text contained in PDF document, and in this case the entire text is exactly the same as the OCR text:

PDDocument document = PDDocument.load(new File("D:/input.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(document.getNumberOfPages());
String sText = stripper.getText(document);
System.out.println(sText);
document.close();

Then I have used example class RemoveAllText provided with PDFBox, in hoping to remove the OCR text from the PDF document. Unfortunately, it removed not only OCR text, but also the graphical presentation of original scanned text. The method that inspects for text elements in PDF document and removes them is shown below:

private static List

mkl · Accepted Answer

There is indeed an error in the createTokensWithoutText code you copied from the PDFBox examples. But the reason for that example removing all text from your scanned PDF is that already the scanner removed the letters from the image, created ad-hoc fonts for them, and drew them again as text using these fonts, so the example simply does what it is meant to do.

Error in `createTokensWithoutText`

While the text showing operators Tj, ', and TJ indeed only have a single parameter, " has three:

a_w a_c string " – Move to the next line and show a text string, using a_w as the word spacing and a_c as the character spacing (setting the corresponding parameters in the text state). a_w and a_c shall be numbers expressed in unscaled text space units.

(ISO 32000-1 Table 109 – Text-showing operators)

If there is a " operation in the stream, therefore, createTokensWithoutText only removes the string argument and the operator but leaves the numeric parameters a_w and a_c in place. This in turn results in an invalid set of arguments for the following instruction in newTokens.

How the example PDF is scanned

The OCR software here did not simply add invisible characters in front of or behind the glyphs in the image to provide text extraction capabilities (which is a very common approach). Instead it actually created ad-hoc fonts from the glyphs in the image, removed the glyphs from the image, and drew them visibly in front of the image.

Thus, the remaining image only contains some dirt the software did not associate with any glyph.

The ad-hoc fonts contain glyphs like this:

As you can see, the fonts even contain multiple glyphs for the same recognized letter, e.g. for 'H' here 9, 13, and 15.

The advantage of this approach is that the PDFs can be manipulated more easily, text chunks can be edited.

Unfortunately for your case, though, the OCR software appears only to know Latin characters and Arabic numbers, it in particular does not appear to know Cyrillic characters. Thus, it assigns the Cyrillic glyphs to the most similar Latin character or Arabic number.

This of course makes text extraction senseless. Furthermore some viewers show the assigned Latin character using some standard font instead of the glyph from the ad-hoc font, in particular when marking the text, and the text shown like that also makes no sense.

Thus, you should either scan again with OCR switched of or export the PDFs as images and build new PDFs from only those images.

Remove OCR text from PDF document using Apache PDFBox

Answers (1)

Error in `createTokensWithoutText`

How the example PDF is scanned

Related Questions

Remove OCR text from PDF document using Apache PDFBox

Answers (1)

Error in createTokensWithoutText

How the example PDF is scanned

Related Questions

Error in `createTokensWithoutText`