Reputation: 97
I used the code (python + AWS textract + lambda) below and The OCR success to get response for the pdf with one page
But when I test with more than one page pdf it doesn't work
need advice to OCR for multipage pdf
client = boto3.client("textract")
response = client.analyze_document(
Document={
"S3Object": {
"Bucket": 'test-bucket',
"Name": 'testpdf.pdf',
}
},
FeatureTypes=["QUERIES",],
QueriesConfig={
"Queries": [
{
"Text": "What is openning balance?",
"Alias": "OPEN"
},
{
"Text": "What is closing balance?",
"Alias": "CLOSE"
}
]
}
)
data = response
need advice
Upvotes: 2
Views: 1266
Reputation: 1506
Multipage PDFs are only supported by the asynchronous API. So you need to use
client.start_document_analysis
which will return a job id that you can use to fetch the results "later". You can pass an SNS topic which will be triggered once the processing is over allowing for a workflow like:
Call Amazon Textract Lambda => SNS Topic triggered => Fetch Results lambda
Upvotes: 2