Apache Tika does not get embedded images in PDF documents

Question

I just found a problem with PDF documents that have embedded images.

Doing:

java -jar tika-app-1.5.jar --extract tika.pdf

Tika can not find the image.

Is this a PDF related problem? Because if i do the same operation with a DOC document Tika finds the image correctly.

Thank you in advance!

Gagravarr · Accepted Answer

You need to upgrade you version of Apache Tika. Support was added through TIKA-1268 after 1.5 was released, which is why you're not getting them with Tika 1.5.

Apache Tika is due out shortly, and when that is released you'll be able to extract images from PDFs without issue using it.

In the mean time, you can either build Tika from source yourself or grab a nightly build. For production use, you'd be best to wait a few days for 1.6, for testing you ought to be OK with a nightly build / build from Trunk (provided the tests passed!)

Apache Tika does not get embedded images in PDF documents

Answers (1)

Related Questions