gmwill934
gmwill934

Reputation: 618

AWS Textract - UnsupportedDocumentException - PDF

I'm using boto3 (aws sdk for python) to analyze a document (a pdf) to get the form key:value pairs.

import boto3

def process_text_analysis(bucket, document):
    # Get the document from S3
    s3_connection = boto3.resource('s3')
    s3_object = s3_connection.Object(bucket, document)
    s3_response = s3_object.get()
    # Analyze the document
    client = boto3.client('textract')
    response = client.analyze_document(Document={'S3Object': {'Bucket': bucket, 'Name': document}},
                                       FeatureTypes=["FORMS"])


process_text_analysis('francismorgan-01', '709 Privado M SURESTE.pdf')

I have followed the documentation for AWS using Analyze Document and when I run my function I get the error.

botocore.errorfactory.UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the AnalyzeDocument operation: Request has unsupported document format

Am I missing something?

Upvotes: 6

Views: 8838

Answers (2)

Miguel Trejo
Miguel Trejo

Reputation: 6667

As the docs say

StartDocumentAnalysis can analyze text in documents that are in JPEG, PNG, TIFF, and PDF format. The documents are stored in an Amazon S3 bucket. Use DocumentLocation to specify the bucket name and file name of the document.

Boto3 Example

import boto3

client = boto3.client('textract')

response = client.start_document_analysis(
    DocumentLocation={
        'S3Object': {
            'Bucket': 'YOUR_BUCKET_NAME',
            'Name': 'YOUR_FILE_KEY_NAME'
        }
    },
    FeatureTypes=["FORMS"]
)

# Get results from asynchronous operation
result = client.get_document_analysis(JobId=response['JobId'])

Additionally, AWS docs provides a class TextractWrapper with methods start_analysis_job and get_analysis_job to do the same as the previous example.

Upvotes: 3

aksyuma
aksyuma

Reputation: 3180

AnalyzeDocument is a synchronous API that only supports PNG or JPG images.

Since you want to work with PDF files, then you'll need to use Amazon Textract Asynchronous API e.g StartDocumentAnalysis, StartDocumentTextDetection

Upvotes: 12

Related Questions