Reputation: 2351
I tried using tesseract-ocr on this image: http://ablazinradio.com/site/wp-content/uploads/2015/06/lebron-james-cavs.jpg but it doesn't return text with "Cavs" or "23", it returns nothing. Are there any other npm modules that would extract the text from that image, or can I do it manually somehow? Thanks.
Upvotes: 0
Views: 5771
Reputation: 126
So, textract is the package that will help for nodejs project and tika for python. But issue with textract is that it required you need to install tools for OS like pdftotext(for pdf), antiword(for word docs), unrtf(for rtf), tesseract(for images), drawingtotext(for DXF files). This will work for traditional server where you know OS. But in cloud functions or lambda functions where you do not know OS and if possible still cost performance.
https://www.npmjs.com/package/textract
Upvotes: 0
Reputation: 764
I just ran this through tesseract, and I got absolute gibberish back.
Tesseract really isn't equipped to process that kind of image, especially without any pre-processing of the image.
I don't think you'll find anything open source that can deal with that image.
Maybe give the Google Vision APIs a go https://cloud.google.com/vision/docs/
Otherwise if you are willing to invest more time into tesseract I suggest looking at the tesseract wiki to try improve your results https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
Upvotes: 2