Yuzuriha Inori
Yuzuriha Inori

Reputation: 165

Converting image identified by PyTesseract to an array

I have an image with a list of numbers which I have scanned using PyTesseract to construct a string. Concretely, here is the code:

from PIL import Image
import pytesseract
from scipy import stats
import numpy as np

pytesseract.pytesseract.tesseract_cmd = r'C:\\\Program Files\\\Tesseract-OCR\\\tesseract.exe'

str1=pytesseract.image_to_string(Image.open('D:/Image.png'))

Here's the image I am scanning:

Image

The problem is that PyTesseract is scanning the image as individual characters instead of integers.

I would like to understand why this is happening and what can I do to get the desired result.

In short, PyTesseract is not scanning integers in a list of numbers, instead scanning them as individual characters. How do I tell it to scan for integers and put them in an array?

Upvotes: 0

Views: 2068

Answers (1)

jizhihaoSAMA
jizhihaoSAMA

Reputation: 12672

Well,If you only want to get a list,Use re.split and strip can solve it.(Because tesseract's result has some errors).

You can try this:

import pytesseract
import re

data = pytesseract.image_to_string('OCR.png')
dataList = re.split(r',|\.| ',data) # split the string
resultList = [int(i.strip()) for i in dataList if i != ''] # remove the '' str and convert str to int.
print(resultList)

# result: [71, 194, 38, 1701, 89, 76, 11, 83, 1629, 48, 94, 63, 132, 16, 111, 95, 84, 341, 975, 14, 40, 64, .......

Upvotes: 2

Related Questions