SteveS
SteveS

Reputation: 4040

How to analyse PDF documents with Amazon Textract in a Synchronous way?

I want to extract tables from a bunch of PDFs I have. To do this I am using AWS Textract Python pipeline.

Please advise how can I do this without SNS and SQS? I want it to be synchronous: provide my pipeline a PDF file, call AWS Textract and get the results.

Here is what I use meanwhile, please advise what should I change:

import boto3
import time

def startJob(s3BucketName, objectName):
    response = None
    client = boto3.client('textract')
    response = client.start_document_text_detection(
    DocumentLocation={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': objectName
        }
    })

    return response["JobId"]

def isJobComplete(jobId):
    # For production use cases, use SNS based notification 
    # Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html
    time.sleep(5)
    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)
    status = response["JobStatus"]
    print("Job status: {}".format(status))

    while(status == "IN_PROGRESS"):
        time.sleep(5)
        response = client.get_document_text_detection(JobId=jobId)
        status = response["JobStatus"]
        print("Job status: {}".format(status))

    return status

def getJobResults(jobId):

    pages = []

    client = boto3.client('textract')
    response = client.get_document_text_detection(JobId=jobId)

    pages.append(response)
    print("Resultset page recieved: {}".format(len(pages)))
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']

    while(nextToken):

        response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)

        pages.append(response)
        print("Resultset page recieved: {}".format(len(pages)))
        nextToken = None
        if('NextToken' in response):
            nextToken = response['NextToken']

    return pages

# Document
s3BucketName = "ki-textract-demo-docs"
documentName = "Amazon-Textract-Pdf.pdf"

jobId = startJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(isJobComplete(jobId)):
    response = getJobResults(jobId)

#print(response)

# Print detected text
for resultPage in response:
    for item in resultPage["Blocks"]:
        if item["BlockType"] == "LINE":
            print ('\033[94m' +  item["Text"] + '\033[0m')

Upvotes: 2

Views: 6280

Answers (3)

dychen
dychen

Reputation: 1

Just as an update to @Paradigm's answer, there is now support for PDFs in Textract.

The document image can be in either PNG, JPEG, PDF, or TIFF format. Results for synchronous operations are returned immediately and are not stored for retrieval.

source: https://docs.aws.amazon.com/textract/latest/dg/sync-calling.html

Upvotes: 0

Soumya
Soumya

Reputation: 431

Thanks for the answers and, those answers helped me to analyse more on this. I found that detect_document_text method in Textract can be used for PDF document text extraction with a condition that the PDF document should have only one page. This is a synchronous process. We do not have to convert the pdf to image at all.

This is the link from AWS for the reference . https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract/client/detect_document_text.html

Below is the code snippet, where i am passing the binary content from S3 object

obj = bucket.Object('Test.pdf') 
binary_file = obj.get().get('Body').read()

textract = boto3.client(service_name = "textract",region_name = "us-east-1")

def get_textract_response(file_content):
    response = None
    try:
        response = textract.detect_document_text(Document={'Bytes': file_content})
        logger.info(f"Detected {len(response['Blocks'])} blocks.")
    except ClientError:
        logger.exception("Couldn't detect text.")
        response = "Uncertain"

    except BaseException:
        logger.info("textract could not detect text")
        response = "Uncertain"
                    
    else:
        return response
    
response = get_textract_response(binary_file)

Upvotes: 1

Paradigm
Paradigm

Reputation: 2026

You cannot directly process PDF documents synchronously with Textract currently. From the Textract documentation:

Amazon Textract synchronous operations (DetectDocumentText and AnalyzeDocument) support the PNG and JPEG image formats. Asynchronous operations (StartDocumentTextDetection, StartDocumentAnalysis) also support the PDF file format.

A work-around would be to convert the PDF document into images in your code and then use the synchronous API operations with these images to process the documents.

Upvotes: 3

Related Questions