buydadip
buydadip

Reputation: 9427

How to get confidence of each line using pytesseract

I have successfully setup Tesseract and can translate the images to text...

text = pytesseract.image_to_string(Image.open(image))

However, I need to get the confidence value for every line. I cannot find a way to do this using pytesseract. Anyone know how to do this?

I know this is possible using PyTessBaseAPI, but I cannot use that, I've spent hours attempting to set it up with no luck, so I need a way to do this using pytesseract.

Upvotes: 10

Views: 17376

Answers (3)

Srikar Appalaraju
Srikar Appalaraju

Reputation: 73638

The current accepted answer is not entirely correct. The correct way to get each line using pytesseract is

text.groupby(['block_num','par_num','line_num'])['text'].apply(list)

We need to do this based on this answer: Does anyone knows the meaning of output of image_to_data, image_to_osd methods of pytesseract?

  1. Column block_num: Block number of the detected text or item
  2. Column par_num: Paragraph number of the detected text or item
  3. Column line_num: Line number of the detected text or item
  4. Column word_num: word number of the detected text or item

But above all 4 columns are interconnected.If the item comes from new line then word number will start counting again from 0, it doesn't continue from previous line last word number. Same goes with line_num, par_num, block_num.

Upvotes: 2

Sandipan Dey
Sandipan Dey

Reputation: 23099

@Srikar Appalaraju is right. Take the following example image:

enter image description here

Now use the following code:

text = pytesseract.image_to_data(gray, output_type='data.frame')
text = text[text.conf != -1]
text.head()

enter image description here

Notice that all five rows have the same block_num, so that if we group by using that column, all the 5 words (texts) will be grouped together. But that's not what we want, we want to group only the first 3 words that belong to the first line and in order to do that properly (in a generic manner) for a large enough image we need to group by all the 4 columns page_num, block_num, par_num and line_num simulataneuosly, in order to compute the confidence for the first line, as shown in the following code snippet:

lines = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['text'] \
                                     .apply(lambda x: ' '.join(list(x))).tolist()
confs = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['conf'].mean().tolist()
    
line_conf = []
    
for i in range(len(lines)):
    if lines[i].strip():
        line_conf.append((lines[i], round(confs[i],3)))

with the following desired output:

[('Ying Thai Kitchen', 91.667),
 ('2220 Queen Anne AVE N', 88.2),
 ('Seattle WA 98109', 90.333),
 ('« (206) 285-8424 Fax. (206) 285-8427', 83.167),
 ('‘uw .yingthaikitchen.com', 40.0),
 ('Welcome to Ying Thai Kitchen Restaurant,', 85.333),
 ('Order#:17 Table 2', 94.0),
 ('Date: 7/4/2013 7:28 PM', 86.25),
 ('Server: Jack (1.4)', 83.0),
 ('44 Ginger Lover $9.50', 89.0),
 ('[Pork] [24#]', 43.0),
 ('Brown Rice $2.00', 95.333),
 ('Total 2 iten(s) $11.50', 89.5),
 ('Sales Tax $1.09', 95.667),
 ('Grand Total $12.59', 95.0),
 ('Tip Guide', 95.0),
 ('TEK=$1.89, 18%=62.27, 20%=82.52', 6.667),
 ('Thank you very much,', 90.75),
 ('Cone back again', 92.667)]

Upvotes: 7

buydadip
buydadip

Reputation: 9427

After much searching, I have figured out a way. Instead of image_to_string, one should use image_to_data. However, this will give you statistics for each word, not each line...

text = pytesseract.image_to_data(Image.open(file_image), output_type='data.frame')

So what I did was saved it as a dataframe, and then used pandas to group by block_num, as each line is grouped into blocks using OCR, I also removed all rows with no confidence values (-1)...

text = text[text.conf != -1]
lines = text.groupby('block_num')['text'].apply(list)

Using this same logic, you can also calculate the confidence per line by calculating the mean confidence of all words within the same block...

conf = text.groupby(['block_num'])['conf'].mean()

Upvotes: 27

Related Questions