How to detect OCR in a scanned Document with pdfbox 2.0.0?

Question

The Problem: I have a large folder with many subfolders with many pdfs in them. Some of them already have OCR on them. Some of them don't. So i wanted to write a Java Program to filter the non OCR PDFs out and copy them to a hot folder.

I tested like 20 Documents and what they all have in common is, that if you open them with editor, you can find the word 'font' and the OCR ones and you cant find it in the non OCR ones. My Question now is: How do i implement this check using PDFbox 2.0.0 ? All the solutions i found dont seem to work only with older versions. And I'm not capable of finding a solution in the documentation. (which is clearly my fault)

Thanks in advance.

Tilman Hausherr · Accepted Answer

Here's how to find out if fonts are on the top level of a page:

    PDDocument doc = PDDocument.load(new File(...));
    PDPage page = doc.getPage(0); // 0 based
    PDResources resources = page.getResources();
    for (COSName fontName : resources.getFontNames())
    {
        System.out.println(fontName.getName());
    }
    doc.close();

Re: mkl suggestion, here's how to extract text:

    PDFTextStripper stripper = new PDFTextStripper();
    stripper.setStartPage(1); // 1 based
    stripper.setEndPage(1);
    String extractedText = stripper.getText(doc);
    System.out.println(extractedText);

How to detect OCR in a scanned Document with pdfbox 2.0.0?

Answers (1)

Related Questions