Is there any way in OCR/tesseract/OpenCV for extracting text from a particular region of an image?

I’m setting up a new invoice extraction method using AI, I able to recognize "Total"/"Company Details" from invoice images but need help with extracting data from that particular region recognized in the invoice image by specifying an area in the image(Xmin, Xmax, Ymin, Ymax)?

Upvotes: 1

Answers (2)

Jay

Reputation: 2069

AWS recently launched a service called Textract that does exactly what you try to achieve.

Blog post + example: https://aws.amazon.com/blogs/machine-learning/automatically-extract-text-and-structured-data-from-documents-with-amazon-textract/

You can provide images, PDFs and Excel files and it extracts and transforms any text into objects. I haven't used the service yet, but plan to on the weekend.

Python example below:

import boto3

# Document
s3BucketName = "ki-textract-demo-docs"
documentName = "simple-document-image.jpg"

# Amazon Textract client
textract = boto3.client('textract')

# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

#print(response)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')

Upvotes: 2

TheExorcist

Reputation: 2004

Looks like you are newbird,so let me help you quick walkthrough of understanding of terms used in your keyword.

OCR is optical character recognition a concept Tesseract is special library handling for OCR. OpenCV helps in image processing library helping in object detection and recognition.

Yes, you can extract the text from image if its more than 300dpi by using tesseract library but before that you should train the tesseract model with that font, if font of text is very new or unknown to system.

Also keep in mind, if you are able to box-image the text prior calling to tesseract, it will work more accurately.

Certain word - box image, dpi will create alert, but these are pivot concepts to your work.

My suggestion, if you want to extract the digits from image, go in step by step.

Process the image by enhancing its quality.
Detect the region which want to be extracted.
Find the contour and area.
Pass it to box-image editor and tune the parameters
Finally give it to Tesseract.

Upvotes: 1

Is there any way in OCR/tesseract/OpenCV for extracting text from a particular region of an image?

Answers (2)

Related Questions