Reputation: 21
I am trying to extract text data by AWS Textract using boto3 package in Python. I was able to find a way to extract two-column format document. I am curious to know if I can also extract three-column format documents. The code for extracting two-column format is given below https://github.com/aws-samples/amazon-textract-code-samples/blob/master/python/03-reading-order.py
import boto3
# Document
documentName = "two-column-image.jpg"
# Amazon Textract client
textract = boto3.client('textract')
# Call Amazon Textract
with open(documentName, "rb") as document:
response = textract.detect_document_text(
Document={
'Bytes': document.read(),
}
)
#print(response)
# Detect columns and print lines
columns = []
lines = []
for item in response["Blocks"]:
if item["BlockType"] == "LINE":
column_found=False
for index, column in enumerate(columns):
bbox_left = item["Geometry"]["BoundingBox"]["Left"]
bbox_right = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]
bbox_centre = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]/2
column_centre = column['left'] + column['right']/2
if (bbox_centre > column['left'] and bbox_centre < column['right']) or (column_centre > bbox_left and column_centre < bbox_right):
#Bbox appears inside the column
lines.append([index, item["Text"]])
column_found=True
break
if not column_found:
columns.append({'left':item["Geometry"]["BoundingBox"]["Left"], 'right':item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]})
lines.append([len(columns)-1, item["Text"]])
lines.sort(key=lambda x: x[0])
for line in lines:
print (line[1])
I was browsing through the textract users' manual, but could not figure it out.
Upvotes: 2
Views: 1011
Reputation: 1
I don't think you need to change the code. This program should be able to work on multi-column (more than two). If you read the code you provide, you will find it work in a loop style.
Upvotes: 0