Shady
Shady

Reputation: 21

Extracting text data by using boto3 python package (AWS Textract)

I am trying to extract text data by AWS Textract using boto3 package in Python. I was able to find a way to extract two-column format document. I am curious to know if I can also extract three-column format documents. The code for extracting two-column format is given below https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/03-reading-order.py

import boto3

# Document
documentName = "two-column-image.jpg"

# Amazon Textract client
textract = boto3.client('textract')

# Call Amazon Textract
with open(documentName, "rb") as document:
    response = textract.detect_document_text(
        Document={
            'Bytes': document.read(),
        }
    )

#print(response)

# Detect columns and print lines
columns = []
lines = []
for item in response["Blocks"]:
      if item["BlockType"] == "LINE":
        column_found=False
        for index, column in enumerate(columns):
            bbox_left = item["Geometry"]["BoundingBox"]["Left"]
            bbox_right = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]
            bbox_centre = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]/2
            column_centre = column['left'] + column['right']/2

            if (bbox_centre > column['left'] and bbox_centre < column['right']) or (column_centre > bbox_left and column_centre < bbox_right):
                #Bbox appears inside the column
                lines.append([index, item["Text"]])
                column_found=True
                break
        if not column_found:
            columns.append({'left':item["Geometry"]["BoundingBox"]["Left"], 'right':item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]})
            lines.append([len(columns)-1, item["Text"]])

lines.sort(key=lambda x: x[0])
for line in lines:
    print (line[1])

I was browsing through the textract users' manual, but could not figure it out.

https://boto3.amazonaws.com/v1/documentation/api/1.12.5/reference/services/textract.html#Textract.Client.analyze_document

Upvotes: 2

Views: 1011

Answers (1)

Zhe XU
Zhe XU

Reputation: 1

I don't think you need to change the code. This program should be able to work on multi-column (more than two). If you read the code you provide, you will find it work in a loop style.

Upvotes: 0

Related Questions