Reputation: 9427
I have successfully setup Tesseract and can translate the images to text...
text = pytesseract.image_to_string(Image.open(image))
However, I need to get the confidence value for every line. I cannot find a way to do this using pytesseract. Anyone know how to do this?
I know this is possible using PyTessBaseAPI, but I cannot use that, I've spent hours attempting to set it up with no luck, so I need a way to do this using pytesseract.
Upvotes: 10
Views: 17376
Reputation: 73638
The current accepted answer is not entirely correct. The correct way to get each line
using pytesseract is
text.groupby(['block_num','par_num','line_num'])['text'].apply(list)
We need to do this based on this answer: Does anyone knows the meaning of output of image_to_data, image_to_osd methods of pytesseract?
But above all 4 columns are interconnected.If the item comes from new line then word number will start counting again from 0, it doesn't continue from previous line last word number. Same goes with line_num, par_num, block_num.
Upvotes: 2
Reputation: 23099
@Srikar Appalaraju is right. Take the following example image:
Now use the following code:
text = pytesseract.image_to_data(gray, output_type='data.frame')
text = text[text.conf != -1]
text.head()
Notice that all five rows have the same block_num
, so that if we group by using that column, all the 5 words (texts) will be grouped together. But that's not what we want, we want to group only the first 3 words that belong to the first line and in order to do that properly (in a generic manner) for a large enough image we need to group by all the 4 columns page_num
, block_num
, par_num
and line_num
simulataneuosly, in order to compute the confidence for the first line, as shown in the following code snippet:
lines = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['text'] \
.apply(lambda x: ' '.join(list(x))).tolist()
confs = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['conf'].mean().tolist()
line_conf = []
for i in range(len(lines)):
if lines[i].strip():
line_conf.append((lines[i], round(confs[i],3)))
with the following desired output:
[('Ying Thai Kitchen', 91.667),
('2220 Queen Anne AVE N', 88.2),
('Seattle WA 98109', 90.333),
('« (206) 285-8424 Fax. (206) 285-8427', 83.167),
('‘uw .yingthaikitchen.com', 40.0),
('Welcome to Ying Thai Kitchen Restaurant,', 85.333),
('Order#:17 Table 2', 94.0),
('Date: 7/4/2013 7:28 PM', 86.25),
('Server: Jack (1.4)', 83.0),
('44 Ginger Lover $9.50', 89.0),
('[Pork] [24#]', 43.0),
('Brown Rice $2.00', 95.333),
('Total 2 iten(s) $11.50', 89.5),
('Sales Tax $1.09', 95.667),
('Grand Total $12.59', 95.0),
('Tip Guide', 95.0),
('TEK=$1.89, 18%=62.27, 20%=82.52', 6.667),
('Thank you very much,', 90.75),
('Cone back again', 92.667)]
Upvotes: 7
Reputation: 9427
After much searching, I have figured out a way. Instead of image_to_string
, one should use image_to_data
. However, this will give you statistics for each word, not each line...
text = pytesseract.image_to_data(Image.open(file_image), output_type='data.frame')
So what I did was saved it as a dataframe, and then used pandas
to group by block_num
, as each line is grouped into blocks using OCR, I also removed all rows with no confidence values (-1)...
text = text[text.conf != -1]
lines = text.groupby('block_num')['text'].apply(list)
Using this same logic, you can also calculate the confidence per line by calculating the mean confidence of all words within the same block...
conf = text.groupby(['block_num'])['conf'].mean()
Upvotes: 27