Reputation: 110
I need to integrate the tesseract-ocr which converts scanned image as pdf to text.
There is tesseractOCRParser already available.
But there is no invoke method given.
When I am trying to build tika with tesseract-ocr referral path I am getting the following error
Results:
Failed tests:
testNoConfig(org.apache.tika.parser.ocr.TesseractOCRConfigTest):
Invalid default tesseractPath value expected:<[]> but was:
<[/home/serendio/tesseract-ocr/]>
Tests run: 569, Failures: 1, Errors: 0, Skipped: 7
Can anyone help me out ???
Or any other-way to resolve this problem??
Upvotes: 1
Views: 15149
Reputation: 1114
I think this can help : https://wiki.apache.org/tika/TikaOCR I followed this guide and I was able to easily extract the content! I simply installed Tesseract and then Tika.
Using Tika 1.9 I was easily able to : - extract the content directly calling a local Tika server - extract the content in a custom application ( you can use the tika-example project) with no effort .
No modification was needed. Everything working out of the box.
Upvotes: 2