Reputation: 618
I'm using boto3 (aws sdk for python) to analyze a document (a pdf) to get the form key:value pairs.
import boto3
def process_text_analysis(bucket, document):
# Get the document from S3
s3_connection = boto3.resource('s3')
s3_object = s3_connection.Object(bucket, document)
s3_response = s3_object.get()
# Analyze the document
client = boto3.client('textract')
response = client.analyze_document(Document={'S3Object': {'Bucket': bucket, 'Name': document}},
FeatureTypes=["FORMS"])
process_text_analysis('francismorgan-01', '709 Privado M SURESTE.pdf')
I have followed the documentation for AWS using Analyze Document and when I run my function I get the error.
botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format
Am I missing something?
Upvotes: 6
Views: 8838
Reputation: 6667
As the docs say
StartDocumentAnalysis can analyze text in documents that are in JPEG, PNG, TIFF, and PDF format. The documents are stored in an Amazon S3 bucket. Use DocumentLocation to specify the bucket name and file name of the document.
import boto3
client = boto3.client('textract')
response = client.start_document_analysis(
DocumentLocation={
'S3Object': {
'Bucket': 'YOUR_BUCKET_NAME',
'Name': 'YOUR_FILE_KEY_NAME'
}
},
FeatureTypes=["FORMS"]
)
# Get results from asynchronous operation
result = client.get_document_analysis(JobId=response['JobId'])
Additionally, AWS docs provides a class TextractWrapper with methods start_analysis_job
and get_analysis_job
to do the same as the previous example.
Upvotes: 3
Reputation: 3180
AnalyzeDocument is a synchronous API that only supports PNG or JPG images.
Since you want to work with PDF files, then you'll need to use Amazon Textract Asynchronous API e.g StartDocumentAnalysis, StartDocumentTextDetection
Upvotes: 12