littleK
littleK

Reputation: 20123

Tesseract Batch Convert Images to Searchable PDF And Multiple Corresponding Text Files

I’m using tesseract to batch convert a list of images to both a searchable PDF as well as a TXT file containing the OCRd text.

tesseract infile outfile -l eng myconfig

This leaves me with outfile.pdf and outfile.txt, the latter of which contains page separators for delimiting text between images.

What I’m really looking to do, however, is to output multiple TXT files on a per-image basis, using the same corresponding image name. For example, Image1.jpg.txt, Image2.jpg.txt, Image3.jpg.txt...

Does tesseract have the option to support this behavior natively? I realize that I can loop through the image file list and execute tesseract on a per-image basis, but this is not ideal as I’d also have to run tesseract a second time to generate the merged PDF. Instead, I’d like to run both options at the same time, with less overall execution time.

I also realize that I can split the merged TXT file on the page separator into multiple text files, but then I have to introduce less elegant code to map and rename all of those split files to correspond to their original image names: Rename 0001.txt to Image1.jpg.txt...

I’m working with both Python 3 and Linux commands at my disposal.

Upvotes: 4

Views: 8401

Answers (4)

Alban Kaperi
Alban Kaperi

Reputation: 625

Converting multiple images to a single PDF file.

On Linux, you can list all images and then pipe them to tesseract

ls *.jpg | tesseract - yourFileName txt pdf

Where:

youFileName: is the name of the output file.

txt pdf: are the output formats, you can also use only one of them.

Converting images to individual text files

On Linux, you can use the for loop to go through files and execute an action for every file.

for FILE in *.jpg; do tesseract $FILE ${FILE::-4}; done

Where:

for FILE in *.jpg : loop through all JPG files (you can change the extension based on your format)

$FILE: is the name of the image file, e.g. 001.jpg

${FILE::-4}: is the name of the image but without the extension, e.g. 001.jpg will be 001 because we removed the last 4 characters.

We need this to name the text files to the corresponding names, e.g.

  • 001.jpg will be converted to 001.txt
  • 002.jpg will be converted to 002.txt

Upvotes: 5

Juan Jullian
Juan Jullian

Reputation: 1

Thank you!

BTW i'm using 4.1.1.

And i discovered another trainedata for spanish language that do a better job than the standard one. Actually recognizes well the "o" character. The only problem is the processing time, but i let the PC working overnight.

Honestly i don't know how the new trainedata file is doing the job better. I donwloaded at: https://github.com/tesseract-ocr/tessdata_best

Upvotes: 0

nguyenq
nguyenq

Reputation: 8345

You can prepare a batch file that loops through the input images and output to both txt and pdf at the same time -- more efficient, one single OCR operation instead of two. You can then split output .txt file to pages.

tesseract inimagefile outfile txt pdf

Upvotes: 1

littleK
littleK

Reputation: 20123

Since Tesseract doesn't seem to handle this natively, I've just developed a function to split the merged TXT file on the page separator into multiple text files. Although from my observations, I'm not sure that Tesseract runs any faster by simultaneously converting batch images to both PDF and TXT (versus running it twice - once for PDF, and once for TXT).

Upvotes: 0

Related Questions