Reputation: 7731
I would like to add OCR capabilities to my Django app running on Heroku. I suspect the easiest way is by using Tesseract. I've noticed that there are a number of python wrappers for Tesseract's API, but what is the best way to get Tesseract installed and running on Heroku? Via a custom buildpack like heroku-buildpack-tesseract maybe?
Upvotes: 1
Views: 2405
Reputation: 661
This is the stable version. See the source repository
$ heroku buildpacks:add --index 1 heroku-community/apt
2) Add Aptfile to project directory
`
$ touch Aptfile
3) Add the folowing to the Aptfile
tesseract-ocr-eng is the English language file for tesseract.
tesseract-ocr
tesseract-ocr-eng
4) Get path to the data downloaded by the tesseract-ocr-eng package
We will use this path for the next step
$ heroku run bash
$ find -iname tessdata # this will give us the path we need
You can exit heroku shell now exit
Set a heroku config variable named TESSDATA_PREFIX to the path returned from find -iname tessdata
cmnd above
$ heroku config:set TESSDATA_PREFIX=./.apt/usr/share/tesseract-ocr/4.00/tessdata
Now set heroku set a heroku config variable named TESSDATA_PREFIX to the path returned from find -iname tessdata
6) Push changes to herokuSet a heroku config variable named TESSDATA_PREFIX to the path returned from find -iname tessdata cmnd above
$ git push heroku master
I hope this helps. Let me know if it works for you.
Upvotes: 0
Reputation: 7731
I'll try to capture some notes on the solution I arrived at here.
My .buildpacks
file:
https://github.com/heroku/heroku-buildpack-python
https://github.com/clearideas/heroku-buildpack-ghostscript
https://github.com/marcolinux/heroku-buildpack-libraries
My .buildpacks_bin_download
file:
tesseract-ocr https://s3.amazonaws.com/tesseract-ocr/heroku/tesseract-ocr-3.02.02.tar.gz 3.02 eng,spa
Here is the key piece of python that does the OCRing of pdf files:
# Additional processing
document_path = Path(str(document.attachment_file))
if document_path.ext == '.pdf':
working_path = Path('temp', document.directory)
working_path.mkdir(parents=True)
input_path = Path(working_path, name)
input_path.write_file(document.attachment_file.read(), 'w')
rb = ReadBot()
args = [
'VBEZ',
# '-sDEVICE=tiffg4',
'-sDEVICE=pnggray',
'-dNOPAUSE',
'-r600x600',
'-sOutputFile=' + str(working_path) + '/page-%00d.png',
str(input_path)
]
ghostscript.Ghostscript(*args)
image_paths = working_path.listdir(pattern='*.png')
txt = ''
for image_path in image_paths:
ocrtext = rb.interpret(str(image_path))
txt = txt + ocrtext
document.notes = txt
document.save()
working_path.rmtree()
Upvotes: 1