How to differentiate between "text" PDFs and "image" PDFs in PHP?

Question

I've recently set up a Linux server to be able to convert text-based PDFs to text by using the pdftotext command that's part of Xpdf as well as to convert image-based PDFs to text by using a combination of the gs (Ghostscript) and tesseract commands.

Both solutions work pretty well when I already know whether a PDF is text-based or image-based. However, in order to automate the process of converting many PDFs to text, I need to be able to tell whether a PDF is text-based or image-based so that I know which set of processes to run on the PDF.

Is there any way in PHP to analyze a PDF and tell whether it's text-based or image-based so that I know whether to use Xpdf or Ghostscript/Tesseract on it?

dankito · Accepted Answer

I think the answer from Kurt Pfeifle here is superb: Use pdffonts - which is also part of Xpdf / Poppler - to list which fonts a PDF uses.

If it uses any font, it contains text. If not, it contains only images.

How to differentiate between "text" PDFs and "image" PDFs in PHP?

Answers (2)

Related Questions

How to differentiate between &quot;text&quot; PDFs and &quot;image&quot; PDFs in PHP?

Answers (2)

Related Questions

How to differentiate between "text" PDFs and "image" PDFs in PHP?