Reputation: 30424
A PDF file extension can be verified by the magic signature: 25 50 44 46
However, I want to detect whether a PDF contains text or image (i.e. whether the PDF contains text that can be searched with ctrl+f OR whether it contains scanned documents)
Is there a way to do this?
Upvotes: 0
Views: 190
Reputation: 20069
Well technically, you could parse the PDF document structure and look for elements that contain text. I imagine this would require a big effort to implement.
So you may want to use a premade PDF package to do the parsing for you (PDFBox, BfoPDF or something similar). Still, I think it will require some effort to implement.
The simplest way that I know of would be to use a package that can extract the plain text for you. Apache TIKA can do this. Just feed it the document and see if you get something back.
In any case it will be hard to classify PDF's that contain both images and text.
Upvotes: 1