Reputation: 1
Actually we need to extract details from the document like Invoice/delivery Challan etc. So I was going through aws Textract demo version where we can simply upload the PDF document and see, what all details it is extracting as key value pair, Table etc.
While doing above activity, I found that few specific keys like Invoice Number,PAN etc which are very important for us, sometimes getting extracted but sometimes they are not, though the document I am using is of quite high quality.
So my question is - Is there any way where we can specifically specify that what all keys, we are required to extract from the document?
If they are available in the document, aws should extract them else, it should keep those fields empty in the Response.
Thanks, Kavita
Upvotes: 0
Views: 936
Reputation: 701
Amazon Textract offers a specialized API for Invoice and Receipts that you might want to use instead of the generic AnalyzeDoc FORMS API. This API is called AnalyzeExpense and offers better results for the Invoices domain.
The advantage of this API is that it supports about 40 generic fields like TOTAL, ACCOUNT_NUMBER, etc that are normalized for you even if they don't appear with these exact words in the documents. For example "Acc. Id", "Account #", "Acc. Nbr" would all be found under the same generic "ACCOUNT_NUMBER" key.
You can use the amazon-textract-textractor
package in order to simplify calling and parsing the Textract output. Here is a tutorial on how to use the AnalyzeExpense API.
In short you can call the API like this:
from textractor import Textractor
extractor = Textractor(profile_name="default")
document = extractor.analyze_expense(
file_source="<YOUR_INVOICE_IMAGE>.png",
save_image=True,
)
and access the fields like that:
print(document.summary_fields)
Upvotes: 0