Reputation: 7830
I've recently set up a Linux server to be able to convert text-based PDFs to text by using the pdftotext
command that's part of Xpdf as well as to convert image-based PDFs to text by using a combination of the gs
(Ghostscript) and tesseract
commands.
Both solutions work pretty well when I already know whether a PDF is text-based or image-based. However, in order to automate the process of converting many PDFs to text, I need to be able to tell whether a PDF is text-based or image-based so that I know which set of processes to run on the PDF.
Is there any way in PHP to analyze a PDF and tell whether it's text-based or image-based so that I know whether to use Xpdf or Ghostscript/Tesseract on it?
Upvotes: 1
Views: 1010
Reputation: 1108
I think the answer from Kurt Pfeifle here is superb: Use pdffonts
- which is also part of Xpdf / Poppler - to list which fonts a PDF uses.
If it uses any font, it contains text. If not, it contains only images.
Upvotes: 1
Reputation: 2929
Comparing the output and deciding if the resulting text from an OCR run is the same as the output from an Xpdf run is a non trivial quest. In the case of a not OCRable PDF text (eg. very small letters), where the text can be extracted by xpdf you will even end with a lot of unnecessary gibberish.
I would suggest extracting images form the PDFs and OCR only those, not the complete PDF. This way
As you are already using xpdf you could use pdfimages -all
to extract images.
[1] This is not 100% correct, as the PDF might be a sandwiched PDF where there is already a OCRed text layer "behind" the image.
Upvotes: 0