Reputation: 31
When looping through pdf pages tesseract recognizes characters on one page similar to:
Table 1 Summary Data 3
Table 2 Unique Data 5
but on another page
Table 3 Reservoir Data 8
Table 4 Surface Data 9
it drops the last numbers so the output is similar to
Table 3 Reservoir Data
Table 4 Surface Data
The numbers 8 and 9 aren't interpreted. I checked the images created from pdf2image
pages = convert_from_path(pdf_path, 500)
and the far right text appears in the page image.
But, the dataframe (df) in the code below does not contain any far right data for the page in question nor any characters that look like recognition was attempted. The pdf pages and images are of equal quality and the characters to the far right are in the same horizontal location.
This is the code I'm using:
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
for pdf_path in pdfs:
pages = convert_from_path(pdf_path, 500)
for pageNum,imgBlob in enumerate(pages):
if pageNum < 8:
if pageNum == 6:
d = pytesseract.image_to_data(imgBlob, config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)
print(pageNum)
print(df)
I wondered if there is a horizontal limit or margin that tesseract cannot read beyond and changed dpi to 400 - I'm assuming 500 is dpi. I'm not finding anything related when googling terms like clipping, margins, or skipping.
Upvotes: 3
Views: 1556
Reputation: 11
Its a problem of page segmentation mode. -- psm 3 not able to detect sparse characters in images. use psm 6, 11 or 12.
Upvotes: 1
Reputation: 19
I have faced same issue with tesseract4, and @K41F4r s solution worked for me with value 12(Sparse text with OSD) for page segmentation mode.
Upvotes: 0
Reputation: 1551
Check if using a different page segmentation mode produces better results
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 6 -l eng+ita'
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
Upvotes: 3