Reputation: 4040
I want to extract tables from a bunch of PDFs I have. To do this I am using AWS Textract Python pipeline.
Please advise how can I do this without SNS and SQS? I want it to be synchronous: provide my pipeline a PDF file, call AWS Textract and get the results.
Here is what I use meanwhile, please advise what should I change:
import boto3
import time
def startJob(s3BucketName, objectName):
response = None
client = boto3.client('textract')
response = client.start_document_text_detection(
DocumentLocation={
'S3Object': {
'Bucket': s3BucketName,
'Name': objectName
}
})
return response["JobId"]
def isJobComplete(jobId):
# For production use cases, use SNS based notification
# Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html
time.sleep(5)
client = boto3.client('textract')
response = client.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
while(status == "IN_PROGRESS"):
time.sleep(5)
response = client.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
return status
def getJobResults(jobId):
pages = []
client = boto3.client('textract')
response = client.get_document_text_detection(JobId=jobId)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
while(nextToken):
response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
return pages
# Document
s3BucketName = "ki-textract-demo-docs"
documentName = "Amazon-Textract-Pdf.pdf"
jobId = startJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(isJobComplete(jobId)):
response = getJobResults(jobId)
#print(response)
# Print detected text
for resultPage in response:
for item in resultPage["Blocks"]:
if item["BlockType"] == "LINE":
print ('\033[94m' + item["Text"] + '\033[0m')
Upvotes: 2
Views: 6280
Reputation: 1
Just as an update to @Paradigm's answer, there is now support for PDFs in Textract.
The document image can be in either PNG, JPEG, PDF, or TIFF format. Results for synchronous operations are returned immediately and are not stored for retrieval.
source: https://docs.aws.amazon.com/textract/latest/dg/sync-calling.html
Upvotes: 0
Reputation: 431
Thanks for the answers and, those answers helped me to analyse more on this. I found that detect_document_text method in Textract can be used for PDF document text extraction with a condition that the PDF document should have only one page. This is a synchronous process. We do not have to convert the pdf to image at all.
This is the link from AWS for the reference . https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract/client/detect_document_text.html
Below is the code snippet, where i am passing the binary content from S3 object
obj = bucket.Object('Test.pdf')
binary_file = obj.get().get('Body').read()
textract = boto3.client(service_name = "textract",region_name = "us-east-1")
def get_textract_response(file_content):
response = None
try:
response = textract.detect_document_text(Document={'Bytes': file_content})
logger.info(f"Detected {len(response['Blocks'])} blocks.")
except ClientError:
logger.exception("Couldn't detect text.")
response = "Uncertain"
except BaseException:
logger.info("textract could not detect text")
response = "Uncertain"
else:
return response
response = get_textract_response(binary_file)
Upvotes: 1
Reputation: 2026
You cannot directly process PDF documents synchronously with Textract currently. From the Textract documentation:
Amazon Textract synchronous operations (
DetectDocumentText
andAnalyzeDocument
) support the PNG and JPEG image formats. Asynchronous operations (StartDocumentTextDetection
,StartDocumentAnalysis
) also support the PDF file format.
A work-around would be to convert the PDF document into images in your code and then use the synchronous API operations with these images to process the documents.
Upvotes: 3