Tazo
Tazo

Reputation: 31

tesseract not picking up characters on right side of page

When looping through pdf pages tesseract recognizes characters on one page similar to:

Table 1 Summary Data                    3
Table 2 Unique  Data                    5

but on another page

Table 3  Reservoir Data                 8
Table 4  Surface Data                   9

it drops the last numbers so the output is similar to

Table 3  Reservoir Data                
Table 4  Surface Data  

The numbers 8 and 9 aren't interpreted. I checked the images created from pdf2image

pages = convert_from_path(pdf_path, 500)

and the far right text appears in the page image.

But, the dataframe (df) in the code below does not contain any far right data for the page in question nor any characters that look like recognition was attempted. The pdf pages and images are of equal quality and the characters to the far right are in the same horizontal location.

This is the code I'm using:

    custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
    for pdf_path in pdfs:
        pages = convert_from_path(pdf_path, 500)

        for pageNum,imgBlob in enumerate(pages):
            if pageNum < 8:
                if pageNum == 6:
                    d = pytesseract.image_to_data(imgBlob, config=custom_config, output_type=Output.DICT)
                    df = pd.DataFrame(d)

                    print(pageNum)
                    print(df)

I wondered if there is a horizontal limit or margin that tesseract cannot read beyond and changed dpi to 400 - I'm assuming 500 is dpi. I'm not finding anything related when googling terms like clipping, margins, or skipping.

Upvotes: 3

Views: 1556

Answers (3)

Nikhil Fande
Nikhil Fande

Reputation: 11

Its a problem of page segmentation mode. -- psm 3 not able to detect sparse characters in images. use psm 6, 11 or 12.

Upvotes: 1

Jamsheer Moideen
Jamsheer Moideen

Reputation: 19

I have faced same issue with tesseract4, and @K41F4r s solution worked for me with value 12(Sparse text with OSD) for page segmentation mode.

Upvotes: 0

K41F4r
K41F4r

Reputation: 1551

Check if using a different page segmentation mode produces better results

custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 6 -l eng+ita'

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

Upvotes: 3

Related Questions