Adriano_jvma
Adriano_jvma

Reputation: 475

Determine whether a PDF page contains text or is purely picture

How to determine whether a PDF page contains text or is purely picture, using Java?

I searched through many forums and websites, but I can not find an answer yet .

Is it possible to extract text from PDF, to know if the page is in the format picture or text?

PdfReader reader = new PdfReader(INPUTFILE);  
        PrintWriter out = new PrintWriter(new FileOutputStream(OUTPUTFILE));              
        for (int i = 1; i <= reader.getNumberOfPages(); i++) { 
         // here I want to test the structure of the page !!!! if it's possible                         
         out.println(PdfTextExtractor.getTextFromPage(reader, i));  
        }

Upvotes: 10

Views: 3984

Answers (2)

Alexander Stepchkov
Alexander Stepchkov

Reputation: 755

With PDFBox 2.x you can try this:

    private boolean hasText(PDDocument doc) throws IOException {
        PDFTextStripper stripper = new PDFTextStripper();
        return stripper.getText(doc).trim().length() != 0;
    }

Unfortunately it scans whole file first and does not stop at first text block. But you can receive whole text if you need.

Upvotes: 0

Bruno Lowagie
Bruno Lowagie

Reputation: 77528

There is no water-proof way to do what you want.

Text can appear in different ways inside a PDF file. For instance: one can draw all the glyphs using graphics state operators instead of using text state. (I'm sorry if this sounds like Chinese to you, but I can assure you it's proper PDF language.)

If an ad hoc solution that covers the most common situations and misses an exotic PDF once in a while is OK for you, then you already have a good first workaround.

In your code, you loop over all the pages, and you ask iText if there's any text on the page. That's already a good indication.

Internally, your code is using the RenderListener interface. iText parses the content of a page and triggers methods in a specific RenderListener implementation. This is an implementation of a custom implementation: MyTextRenderListener. This custom implementation is used in the ParsingHelloWorld example.

There's also a renderImage() method (see for instance MyImageListener). If this method is triggered, you're 100% sure that there's also an Image in the page, and you can use the ImageRenderInfo object to obtain the position, width and the height of the image (that is: if you know how to interpret the Matrix returned by the getImageCTM() method).

Using all these elements, you can already get a long way to achieving what you need, but be aware that there will always be exotic PDFs that will escape all your checks.

Upvotes: 8

Related Questions