How to do OCR for multi page using (Python + AWS Textract + Lambda)

I used the code (python + AWS textract + lambda) below and The OCR success to get response for the pdf with one page

But when I test with more than one page pdf it doesn't work

need advice to OCR for multipage pdf

    client = boto3.client("textract")
    response = client.analyze_document(
        Document={
            "S3Object": {
                "Bucket": 'test-bucket',
                "Name": 'testpdf.pdf',
            }
        },
        FeatureTypes=["QUERIES",],
        QueriesConfig={
            "Queries": [
                {
                "Text": "What is openning balance?",
                "Alias": "OPEN"
                },
                {
                "Text": "What is closing balance?",
                "Alias": "CLOSE"
                }
                ]
            }
        )

    data = response

need advice

Upvotes: 2

Views: 1266

Answers (1)

Belval
Belval

Reputation: 1506

Multipage PDFs are only supported by the asynchronous API. So you need to use client.start_document_analysis which will return a job id that you can use to fetch the results "later". You can pass an SNS topic which will be triggered once the processing is over allowing for a workflow like:

Call Amazon Textract Lambda => SNS Topic triggered => Fetch Results lambda

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.start_document_analysis

Upvotes: 2

Related Questions