Reputation: 1683
I would like to batch OCR
about 5800 PDF
(consisting each between 2 to 6 pages from my last question here) with open source command line tools on a Mac. The main propose of this adventure is that I want to retrieve as reliable as I can names (surnames most importantly) from the text of all these PDF
. Here is an example how an issue looks like.
At this point, I do not know exactly how to proceed. What would you do?
I had in mind to first convert all multipage PDF
to a single page image as either png
, jpg
, or tif
and move all images related to one PDF
into a respective folder with the following command:
time for i in *.pdf; do mkdir "${i%.pdf}"; convert -colorspace GRAY -resize 3000x -units PixelsPerInch "$i" "${i%.pdf}.jpg”; mv *.jpg "${i%.pdf}"; done
As a second step, I would have the issue that my OCR script would need to enter each folder, do its magic, and leave it in order to proceed with the next. I do not know how to write this. The core of the script would be:
tesseract --tessdata-dir /usr/local/share/tessdata/ --oem 3 --psm 11 -l deu_frak *.jpg test.txt
As the PDF
represent old newspaper article, which have been published almost every day between 1810 and 1832, they are written in German Fraktur. This font seems to be particularly challenging for tesseract
. My text output is normally scrambled, e. g. on the above linked article I would get only between 791 to 801 diacritics detected for the first page. Names are at risk not be identified as such depending on the chosen options.
At the end, I would use the silver searcher to look for names within all 5800 txt-files, I hope to obtain.
time rg -i search_term_here
Finally, how can make sure that I get the best possible OCR output so that I obtain most of the (sur)names in the texts?
P.S.: When will by the way tesseract 4
be around for Mac
and with German Fraktur training data?
Edit:
These are the commands, I have used to achieve what I wanted. Although the output of tesseract could still be improved a great deal.
Convert each PDF
into jpg and move them to respective folders to keep order:
time parallel -j 8 'mkdir {.} && convert {} -colorspace GRAY -resize 3000x -units PixelsPerInch {.}/{.}.jpg' ::: *.pdf
Using Fred's ImageMagick script textcleaner (which I have moved to /usr/local/bin/
for usability), to enhance the tesseract output a bit:
time find . -name \*.jpg | parallel textcleaner -g -e stretch -f 25 -o 10 -u -s 1 -T -p 10 {} {}
Parallelising the tesseract analyses:
time find . -name \*.jpg | parallel -j 8 “tesseract {} {.}.txt —tessdata-dir /usr/local/share/tessdata/ -l deu_frak”
Search for the surnames with the silver searcher:
time rg -t txt -i term
Upvotes: 1
Views: 619
Reputation: 208013
First, I would recommend you install homebrew if you have not already - it is an excellent package manager for the Mac.
Then I would recommend you install the Poppler package to get the pdfimages
tool:
brew install poppler
You can then extract images from a PDF like this:
pdfimages SomeFile.pdf root
and you will get files named root-000.ppm
and root-001.ppm
which will work fine with tesseract
. Or you can add -png
if you want PNG images. I would avoid JPEG because of lossy compression.
If you can get that working, I would then suggest you install GNU Parallel with:
brew install parallel
and we can work on doing OCR in parallel down the line.
We can also extract the images in parallel using GNU Parallel like this:
parallel 'mkdir {.} && pdfimages {} {.}/{.}' ::: *pdf
As regards using Fred's textcleaner
with GNU Parallel, and wanting to overwrite the JPEGs, I think you will want something like this:
find . -name \*.jpg | parallel textcleaner -g -e stretch -f 25 -o 10 -u -s 1 -T -p 10 {} {}
Upvotes: 3