Reputation: 444
Some of PDF documents in system were created by scanning with OCR text included. However, OCR was not performed correctly (mixed up Cyrillic and Latin characters) and although the document looks like searchable, that information is completely incorrect and unusable.
When looking at PDF document in Adobe Acrobat Reader DC (or Google Chrome) it is displayed correctly, but on a web page that uses PDF.js to render the document, the OCR text shows in front, instead of scanned graphical presentation of original text.
The idea is to "repair" these documents by removing OCR text from PDF document, while preserving scanned graphical presentation of original text.
For that purpose I have used Apache PDFBox 2.0.11 to inspect the contents of the PDF document. The following code snippet prints out the entire text contained in PDF document, and in this case the entire text is exactly the same as the OCR text:
PDDocument document = PDDocument.load(new File("D:/input.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(document.getNumberOfPages());
String sText = stripper.getText(document);
System.out.println(sText);
document.close();
Then I have used example class RemoveAllText provided with PDFBox, in hoping to remove the OCR text from the PDF document. Unfortunately, it removed not only OCR text, but also the graphical presentation of original scanned text. The method that inspects for text elements in PDF document and removes them is shown below:
private static List<Object> createTokensWithoutText(PDContentStream contentStream) throws IOException
{
PDFStreamParser parser = new PDFStreamParser(contentStream);
Object token = parser.parseNextToken();
List<Object> newTokens = new ArrayList<Object>();
while (token != null)
{
if (token instanceof Operator)
{
Operator op = (Operator) token;
if ("TJ".equals(op.getName()) || "Tj".equals(op.getName()) ||
"'".equals(op.getName()) || "\"".equals(op.getName()))
{
// remove the one argument to this operator
newTokens.remove(newTokens.size() - 1);
token = parser.parseNextToken();
continue;
}
}
newTokens.add(token);
token = parser.parseNextToken();
}
return newTokens;
}
I presume that this method should be changed in some manner (to remove just text and not to remove its graphical presentation), but I'm not aware how to do it.
Here is an example of PDF document before RemoveAllText, and here is an example of PDF document after RemoveAllText.
Upvotes: 0
Views: 2199
Reputation: 95938
There is indeed an error in the createTokensWithoutText
code you copied from the PDFBox examples. But the reason for that example removing all text from your scanned PDF is that already the scanner removed the letters from the image, created ad-hoc fonts for them, and drew them again as text using these fonts, so the example simply does what it is meant to do.
createTokensWithoutText
While the text showing operators Tj, ', and TJ indeed only have a single parameter, " has three:
aw ac string " – Move to the next line and show a text string, using aw as the word spacing and ac as the character spacing (setting the corresponding parameters in the text state). aw and ac shall be numbers expressed in unscaled text space units.
(ISO 32000-1 Table 109 – Text-showing operators)
If there is a " operation in the stream, therefore, createTokensWithoutText
only removes the string argument and the operator but leaves the numeric parameters aw and ac in place. This in turn results in an invalid set of arguments for the following instruction in newTokens
.
The OCR software here did not simply add invisible characters in front of or behind the glyphs in the image to provide text extraction capabilities (which is a very common approach). Instead it actually created ad-hoc fonts from the glyphs in the image, removed the glyphs from the image, and drew them visibly in front of the image.
Thus, the remaining image only contains some dirt the software did not associate with any glyph.
The ad-hoc fonts contain glyphs like this:
As you can see, the fonts even contain multiple glyphs for the same recognized letter, e.g. for 'H' here 9, 13, and 15.
The advantage of this approach is that the PDFs can be manipulated more easily, text chunks can be edited.
Unfortunately for your case, though, the OCR software appears only to know Latin characters and Arabic numbers, it in particular does not appear to know Cyrillic characters. Thus, it assigns the Cyrillic glyphs to the most similar Latin character or Arabic number.
This of course makes text extraction senseless. Furthermore some viewers show the assigned Latin character using some standard font instead of the glyph from the ad-hoc font, in particular when marking the text, and the text shown like that also makes no sense.
Thus, you should either scan again with OCR switched of or export the PDFs as images and build new PDFs from only those images.
Upvotes: 2