Dave
Dave

Reputation: 883

Why is pytesseract.image_to_string not preserving interword spaces?

Using Tesseract

PS C:\Program Files\Tesseract-OCR> .\tesseract --version
tesseract v5.3.0.20221222
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5
 Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0

I have tested Tesseract successfully on command line:

PS C:\Program Files\Tesseract-OCR> .\tesseract C:\ocr\target\31832_226140__0001-00002b.jpg C:\ocr\results\31832_226140__0001-00002bb6523dpi300fullest --dpi 300 --psm 6 -c preserve_interword_spaces=1 -c tessedit_char_whitelist='abcdefghijklm
nopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. '

Partial output

269 Wellington Road Wainumomats Marned       101 ARNOLD. Frank Witham ...............................15 Rossiter Avenue.Lower Hutt. Butcher
002 ANKER. Doreen Akson .............................4 Bledisioe Crescent. Wamuiomata. Teacher       102 ARONA. Amosa ...............0000...........3 Donnelley Drve.Wasnuiomata.Pub. Servant
004 ANKER. Robert James ..........................269 Wellington Road.WainuiomataBank Off       104 ARPS. Velde Lucia ................ ..........53 Westminster Road Wamnuomata Resch Intvr
005 ANNESLEV. Boyne Evan .............................. 13 Manurewa GroveWainwomata Clerk       105 ARPS. Wilkem David ..........................53 Westmnster Road. Waimuomata.Foreman
006 ANNESLEY. Janet Maree ....................13 Manurewa Grove Wainuomats Housewite       106 ARROWSMITH. Margaret Bessie .... ... . 4 Isabel Grove. Wainuiomata. Mamed
007 ANSELL. Anme Ena Elizabeth .........................3 Lewghton Av. Lower Hutt. Homemaker       107 ARROWSMITH. Morns Anthony ................ . 4 Isabel Gr Wamuomata Fetry Magr
O08 ANGELL. Eb se by oe ceeseceereeess 76 Bell Road. Lower Hutt. Housewrfe

I need to process hundreds of files so I downloaded and installed pytesseract.

Successfully installed pytesseract-0.3.10

I upgraded pip

Successfully installed pip-23.0.1

I have run tox

PS C:\Program Files\Tesseract-OCR> tox
←[1m←[35mROOT:←[0m←[36m No tox.ini or setup.cfg or pyproject.toml found, assuming empty tox.ini at C:\Program Files\Tesseract-OCR←[0m
  py: OK (4.34 seconds)
  congratulations :) (4.67 seconds)

However when I run the following, same path-to-exe, python script interword spacing is not preserved.

import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
image = 'C:\\ocr\\target\\31832_226140__0001-00002b.jpg'
target = print(pytesseract.image_to_string(image, config='--dpi 96 --psm 6 -c preserve_interword_spaces=1 -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. '))

Partial output

.269WellngtonRoadWainumomatsMarned       101ARNOLD.FrankWitham...............................15RossiterAvenue.LowerHutt.Butcher       
002ANKER.DoreenAkson.............................4BledisioeCrescent.Wamuiomata.Teacher       102ARONA.Amosa...............0000...........3DonnelleyDrve.Wasnuiomata.Pub.Servant      
004ANKER.RobertJames..........................269WellingtonRoad.WainuiomataBankOff       104ARPS.ValdaLucis..........................53WestminsterRoadWamnuomataReschIntvr
005ANNESLEV.BoyneEvan..............................13ManurewaGroveWainwomataClerk       105ARPS.WilkemDavid..........................53WestmnsterRoad.Waimuomata.Foreman
006ANNESLEY.JanotMaree....................13ManurewaGroveWainuomatsHousewite       106ARROWSMITH.MargaretBessie........4IsabelGrove.Wainuiomata.Mamed
007ANSELL.AnmeEnaElizabeth.........................3LewghtonAv.LowerHutt.Homemaker       107ARROWSMITH.MornsAnthony.................4IsabelGrWamuomata.FetryMagr
O008ANMGELL.Ebsebyyceeseceereeess76BellRoad.LowerHutt.Housewrfe       108ARTHUR.BruceJames....................65MoohanStreet.WainuomataApp.Mouider 

Can anyone see why this python-tesseract print image to string command is not using the config parameter preserve_interword_spaces=1 like the tesseract command line example?

Upvotes: 3

Views: 1246

Answers (2)

Mikhail Ilin
Mikhail Ilin

Reputation: 195

I had an issue with witespace not being taken into account by tessedit_char_whitelist config option on Windows with pytesseract==0.3.13.

config = "--psm 6 -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 "

If i didn't use quotes, like in the exemple above, whitespace was not taken into account.

config = "--psm 6 -c tessedit_char_whitelist='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 '"

If i used quotes like in my last exemple or in solution from this thread, i was receiving "ValueError: No closing quotation" from "shlex" module.

Turns out that pytesseract uses shlex to parse config parameters, and apparently does it differently on windows VS other system:

# snippet from pytesseract.py > run_tesseract method
cmd_args = []
not_windows = not (sys.platform == 'win32')
if not_windows and nice != 0:
    cmd_args += ('nice', '-n', str(nice))

cmd_args += (tesseract_cmd, input_filename, output_filename_base)

if lang is not None:
    cmd_args += ('-l', lang)

if config:
    cmd_args += shlex.split(config, posix=not_windows)

If i hardcode not_windows=True, the second config (with quotes) is properly executed by pytesseract and I receive expected results (no error about quotes, whitespace taken into account).

This is clearly "quick and dirty" solution, so any comment on how to deal with it properly is welcome!

Upvotes: 1

Dave
Dave

Reputation: 883

The answer is making sure that you are NOT omitting the space character from the 'whitelist'. Because this effectively removes spaces from the output. Thus making it look like the preserve_interword_spaces=1 parameter is not functioning.

For reference. The correct command should have been:

target = print(pytesseract.image_to_string(image, config='--dpi 96 --psm 6 -c preserve_interword_spaces=1 -c tessedit_char_whitelist="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. "'))

The use of single/double quotes is important. The single quotes surround the complete config statement. The double quotes for the literal whitelist.

It would seem from this that the whitelist has precedence over the preserve_interword_spaces parameter. The preserve_interword_spaces parameter may be redundant if you are including a space in your whitelist.

Upvotes: 1

Related Questions