jkortner
jkortner

Reputation: 631

Amazon Textract without using Amazon S3

I want to extract information from PDFs using Amazon Textract (as in How to use the Amazon Textract with PDF files). All the answers and the AWS documentation requires the input to be Amazon S3 objects.

Can I use Textract without uploading the PDFs to Amazon S3, but just giving them in the REST call? (I have to store the PDFs locally).

Upvotes: 0

Views: 3005

Answers (2)

nikhilweee
nikhilweee

Reputation: 4559

I would love to be proven wrong, but as of August 2023 this is not possible.

According to the documentation, all the async methods (StartDocumentAnalysis, StartDocumentTextDetection, StartExpenseAnalysis and StartLendingAnalysis) need a required argument called DocumentLocation.

Looking at the documentation again, DocumentLocation needs information about an S3 object.

   "DocumentLocation": { 
      "S3Object": { 
         "Bucket": "string",
         "Name": "string",
         "Version": "string"
      }
   }

Upvotes: 3

smac2020
smac2020

Reputation: 10734

I will answer this question with the Java API in mind. The short answer is Yes.

If you look at this TextractAsyncClient Javadoc for a given operation:

https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/textract/TextractAsyncClient.html#analyzeDocument-software.amazon.awssdk.services.textract.model.AnalyzeDocumentRequest-

It states:

" Documents for asynchronous operations can also be in PDF format"

This means - you can reference a PDF document and create an AnalyzeDocumentRequest object like this (without pulling from an Amazon S3 bucket). :

public static void analyzeDoc(TextractClient textractClient, String sourceDoc) {

        try {
            InputStream sourceStream = new FileInputStream(new File(sourceDoc));
            SdkBytes sourceBytes = SdkBytes.fromInputStream(sourceStream);

            // Get the input Document object as bytes
            Document myDoc = Document.builder()
                    .bytes(sourceBytes)
                    .build();

            List<FeatureType> featureTypes = new ArrayList<FeatureType>();
            featureTypes.add(FeatureType.FORMS);
            featureTypes.add(FeatureType.TABLES);

            AnalyzeDocumentRequest analyzeDocumentRequest = AnalyzeDocumentRequest.builder()
                    .featureTypes(featureTypes)
                    .document(myDoc)
                    .build();

// Use the TextractAsyncClient to perform an operation like analyzeDocument

...
}

Upvotes: 1

Related Questions