HartleySan
HartleySan

Reputation: 7830

How to differentiate between "text" PDFs and "image" PDFs in PHP?

I've recently set up a Linux server to be able to convert text-based PDFs to text by using the pdftotext command that's part of Xpdf as well as to convert image-based PDFs to text by using a combination of the gs (Ghostscript) and tesseract commands.

Both solutions work pretty well when I already know whether a PDF is text-based or image-based. However, in order to automate the process of converting many PDFs to text, I need to be able to tell whether a PDF is text-based or image-based so that I know which set of processes to run on the PDF.

Is there any way in PHP to analyze a PDF and tell whether it's text-based or image-based so that I know whether to use Xpdf or Ghostscript/Tesseract on it?

Upvotes: 1

Views: 1010

Answers (2)

dankito
dankito

Reputation: 1108

I think the answer from Kurt Pfeifle here is superb: Use pdffonts - which is also part of Xpdf / Poppler - to list which fonts a PDF uses.

If it uses any font, it contains text. If not, it contains only images.

Upvotes: 1

tobltobs
tobltobs

Reputation: 2929

Comparing the output and deciding if the resulting text from an OCR run is the same as the output from an Xpdf run is a non trivial quest. In the case of a not OCRable PDF text (eg. very small letters), where the text can be extracted by xpdf you will even end with a lot of unnecessary gibberish.

I would suggest extracting images form the PDFs and OCR only those, not the complete PDF. This way

  • You don't have to compare texts [1].
  • Depending how the images are included into the PDF you also might get better OCR results.
  • Also you would avoid unnecessarily OCRing text which is contained as clear text.

As you are already using xpdf you could use pdfimages -all to extract images.

[1] This is not 100% correct, as the PDF might be a sandwiched PDF where there is already a OCRed text layer "behind" the image.

Upvotes: 0

Related Questions