How to get confidence of each line using pytesseract

I have successfully setup Tesseract and can translate the images to text...

text = pytesseract.image_to_string(Image.open(image))

However, I need to get the confidence value for every line. I cannot find a way to do this using pytesseract. Anyone know how to do this?

I know this is possible using PyTessBaseAPI, but I cannot use that, I've spent hours attempting to set it up with no luck, so I need a way to do this using pytesseract.

Upvotes: 10

Answers (3)

Srikar Appalaraju

Reputation: 73638

The current accepted answer is not entirely correct. The correct way to get each line using pytesseract is

text.groupby(['block_num','par_num','line_num'])['text'].apply(list)

We need to do this based on this answer: Does anyone knows the meaning of output of image_to_data, image_to_osd methods of pytesseract?

Column block_num: Block number of the detected text or item
Column par_num: Paragraph number of the detected text or item
Column line_num: Line number of the detected text or item
Column word_num: word number of the detected text or item

But above all 4 columns are interconnected.If the item comes from new line then word number will start counting again from 0, it doesn't continue from previous line last word number. Same goes with line_num, par_num, block_num.

Upvotes: 2

Sandipan Dey

Reputation: 23099

@Srikar Appalaraju is right. Take the following example image:

Now use the following code:

text = pytesseract.image_to_data(gray, output_type='data.frame')
text = text[text.conf != -1]
text.head()

Notice that all five rows have the same block_num, so that if we group by using that column, all the 5 words (texts) will be grouped together. But that's not what we want, we want to group only the first 3 words that belong to the first line and in order to do that properly (in a generic manner) for a large enough image we need to group by all the 4 columns page_num, block_num, par_num and line_num simulataneuosly, in order to compute the confidence for the first line, as shown in the following code snippet:

lines = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['text'] \
                                     .apply(lambda x: ' '.join(list(x))).tolist()
confs = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['conf'].mean().tolist()
    
line_conf = []
    
for i in range(len(lines)):
    if lines[i].strip():
        line_conf.append((lines[i], round(confs[i],3)))

with the following desired output:

[('Ying Thai Kitchen', 91.667),
 ('2220 Queen Anne AVE N', 88.2),
 ('Seattle WA 98109', 90.333),
 ('« (206) 285-8424 Fax. (206) 285-8427', 83.167),
 ('‘uw .yingthaikitchen.com', 40.0),
 ('Welcome to Ying Thai Kitchen Restaurant,', 85.333),
 ('Order#:17 Table 2', 94.0),
 ('Date: 7/4/2013 7:28 PM', 86.25),
 ('Server: Jack (1.4)', 83.0),
 ('44 Ginger Lover $9.50', 89.0),
 ('[Pork] [24#]', 43.0),
 ('Brown Rice $2.00', 95.333),
 ('Total 2 iten(s) $11.50', 89.5),
 ('Sales Tax $1.09', 95.667),
 ('Grand Total $12.59', 95.0),
 ('Tip Guide', 95.0),
 ('TEK=$1.89, 18%=62.27, 20%=82.52', 6.667),
 ('Thank you very much,', 90.75),
 ('Cone back again', 92.667)]

Upvotes: 7

buydadip

Reputation: 9427

After much searching, I have figured out a way. Instead of image_to_string, one should use image_to_data. However, this will give you statistics for each word, not each line...

text = pytesseract.image_to_data(Image.open(file_image), output_type='data.frame')

So what I did was saved it as a dataframe, and then used pandas to group by block_num, as each line is grouped into blocks using OCR, I also removed all rows with no confidence values (-1)...

text = text[text.conf != -1]
lines = text.groupby('block_num')['text'].apply(list)

Using this same logic, you can also calculate the confidence per line by calculating the mean confidence of all words within the same block...

conf = text.groupby(['block_num'])['conf'].mean()

Upvotes: 27

How to get confidence of each line using pytesseract

Answers (3)

Related Questions