Reputation: 631
I want to extract information from PDFs using Amazon Textract (as in How to use the Amazon Textract with PDF files). All the answers and the AWS documentation requires the input to be Amazon S3 objects.
Can I use Textract without uploading the PDFs to Amazon S3, but just giving them in the REST call? (I have to store the PDFs locally).
Upvotes: 0
Views: 3005
Reputation: 4559
I would love to be proven wrong, but as of August 2023 this is not possible.
According to the documentation, all the async methods (StartDocumentAnalysis, StartDocumentTextDetection, StartExpenseAnalysis and StartLendingAnalysis) need a required argument called DocumentLocation
.
Looking at the documentation again, DocumentLocation
needs information about an S3 object.
"DocumentLocation": {
"S3Object": {
"Bucket": "string",
"Name": "string",
"Version": "string"
}
}
Upvotes: 3
Reputation: 10734
I will answer this question with the Java API in mind. The short answer is Yes.
If you look at this TextractAsyncClient Javadoc for a given operation:
It states:
" Documents for asynchronous operations can also be in PDF format"
This means - you can reference a PDF document and create an AnalyzeDocumentRequest object like this (without pulling from an Amazon S3 bucket). :
public static void analyzeDoc(TextractClient textractClient, String sourceDoc) {
try {
InputStream sourceStream = new FileInputStream(new File(sourceDoc));
SdkBytes sourceBytes = SdkBytes.fromInputStream(sourceStream);
// Get the input Document object as bytes
Document myDoc = Document.builder()
.bytes(sourceBytes)
.build();
List<FeatureType> featureTypes = new ArrayList<FeatureType>();
featureTypes.add(FeatureType.FORMS);
featureTypes.add(FeatureType.TABLES);
AnalyzeDocumentRequest analyzeDocumentRequest = AnalyzeDocumentRequest.builder()
.featureTypes(featureTypes)
.document(myDoc)
.build();
// Use the TextractAsyncClient to perform an operation like analyzeDocument
...
}
Upvotes: 1