Textract Unsupported Document Exception

I'm trying to use boto3 to run a textract detect_document_text request.

I'm using the following code:

client = boto3.client('textract')
response = client.detect_document_text(
             Document={
            'Bytes': image_b64['document_b64']
        }
      )

Where image_b64['document_b64'] is a base64 image code that I converted using, for exemplo, https://base64.guru/converter/encode/image website.

But I'm getting the following error:

UnsupportedDocumentException

What I'm doing wrong?

Upvotes: 3

Answers (4)

jeremyforan

Reputation: 1437

This worked for me. It assumes you have configured the ~/.aws with your aws credentials

import boto3
import os

def main():
    client = boto3.client('textract', region_name="ca-central-1")

    for imageFile in os.listdir('./img'):

        image_file = f"./imgs/{imageFile}"

        with open(image_file, "rb") as f:

            response = client.analyze_expense(
                Document={
                    'Bytes': f.read(),
                    'S3Object': {
                        'Bucket': 'REDACTED',
                        'Name': imageFile,
                        'Version': '1'
                    }
                })

            print(response)

if __name__ == "__main__":
    main()

Upvotes: 0

Aman

Reputation: 111

With Boto3 if you are using Jupyternotebook for image (.jpg or .png), you can use:

import boto3
import cv2 
with open(images_path, "rb") as img_file:
  img_str = bytearray(img_file.read())
textract = boto3.client('textract')
response = textract.detect_document_text(Document={'Bytes': img_str})

Upvotes: 0

Gabriel Marcondes

Reputation: 301

For future reference, I solved that problem using:

client = boto3.client('textract')
image_64_decode = base64.b64decode(image_b64['document_b64']) 
bytes = bytearray(image_64_decode)
response = client.detect_document_text(
    Document={
        'Bytes': bytes
    }
)

Upvotes: 2

dz902

Reputation: 5828

Per doc:

If you're using an AWS SDK to call Amazon Textract, you might not need to base64-encode image bytes passed using the Bytes field.

Base64-encoding is only required when directly invoking the REST API. When using Python or NodeJS SDK, use native bytes (binary bytes).

Upvotes: 1

Textract Unsupported Document Exception

Answers (4)

Related Questions