Riccardo
Riccardo

Reputation: 11

Extracting data from a table with known labels with tesseract

I am trying to use Tesseract to create a small Windows application that allows the user to:

The app works fine, but there are still many errors in data extraction. Sometimes, some values are not extracted at all because the label is not correctly recognized. Other times, even if the labels are recognized correctly and the data are extracted, the numbers are incorrect. Also I noticed that the error quote is higher on my work PC, probably because the screen resolution (and so the screenshot) is lower than my home PC.

I am wondering if there is a more reliable way to accomplish my goal.

Below I attached some images of the App to give you an idea, an example of the table and the python script I am using for OCR.

Thank you very much for your help!!!

tesseract v5.4.0.20240606

Python 3.13.1

def preprocess_image(image_path):
    image = cv2.imread(image_path)

    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

    horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (50, 1))
    detect_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
    cnts = cv2.findContours(detect_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[-2]
    for c in cnts:
        cv2.drawContours(thresh, [c], -1, (0, 0, 0), 2)

    vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 15))
    detect_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=3)
    cnts = cv2.findContours(detect_vertical, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[-2]
    for c in cnts:
        cv2.drawContours(thresh, [c], -1, (0, 0, 0), 5)

    result = cv2.bitwise_and(image, image, mask=thresh)
    result[thresh == 0] = (255, 255, 255)  # Set background to white

    return result

def extract_text(image):
    """Extract text from the processed image using Tesseract."""
    return pytesseract.image_to_string(image, lang="eng", config="--psm 6")

enter image description here

enter image description here

enter image description here

enter image description here

Upvotes: 0

Views: 33

Answers (0)

Related Questions