Kovalan R
Kovalan R

Reputation: 110

How to integrate tesseract-ocr with tika?

I need to integrate the tesseract-ocr which converts scanned image as pdf to text.

There is tesseractOCRParser already available.

But there is no invoke method given.

When I am trying to build tika with tesseract-ocr referral path I am getting the following error

Results:

Failed tests:   
testNoConfig(org.apache.tika.parser.ocr.TesseractOCRConfigTest): 
Invalid default tesseractPath value expected:<[]> but was: 
<[/home/serendio/tesseract-ocr/]>

Tests run: 569, Failures: 1, Errors: 0, Skipped: 7

Can anyone help me out ???

Or any other-way to resolve this problem??

Upvotes: 1

Views: 15149

Answers (1)

Alessandro Benedetti
Alessandro Benedetti

Reputation: 1114

I think this can help : https://wiki.apache.org/tika/TikaOCR I followed this guide and I was able to easily extract the content! I simply installed Tesseract and then Tika.

Using Tika 1.9 I was easily able to : - extract the content directly calling a local Tika server - extract the content in a custom application ( you can use the tika-example project) with no effort .

No modification was needed. Everything working out of the box.

Upvotes: 2

Related Questions