Akshay Godase
Akshay Godase

Reputation: 187

Extract PDF table data using Azure Form Recognizer

I am working on an invoice processing project using Azure From Recognizer. All the invoices are in PDF format. I am using a custom form recognizer with labeling. I can extract some data from PDF like Invoice No, Invoice Date, Amount, etc., but I want to extract table data from the pdf using Azure Form Recognizer, but it is not reading the table correctly.

I have labeled the cells which I need and when the number of rows in the table increases it reads the column correctly, but it is unable to separate the values of each row from each other and returns the whole column as a single value.

I tried to provide more examples, but it is still failing to detect the correct table. Is there any way to extract table data properly from PDF using Azure Form Recognizer?

Scanning the table is an essential requirement for our application, and it will decide if we base our application using Azure Form Recognizer or not.

Please see the below PDF table image and want to extract all row data from all columns. enter image description here

If you can point us in the right direction with some documentation on this, then it would be beneficial.

Thanks

Upvotes: 0

Views: 3521

Answers (2)

Anatolip
Anatolip

Reputation: 1

Form Recognizer released Invoice specific model which works across different invoice layouts. Please take a look at documentation below:

https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/concept-invoice

It allows to extract header fields as well as line items and its details.

You can try this model using Form Recognizer Studio (need Azure Subscription and Form Recognizer resource): https://formrecognizer.appliedai.azure.com/studio/prebuilt?formType=invoice

Upvotes: 0

Neta
Neta

Reputation: 720

Please try the following -

  1. Train without labels and see if it detects and extracts the table you need. See quickstart here - https://learn.microsoft.com/en-us/azure/cognitive-services/form-recognizer/quickstarts/python-train-extract?tabs=v2-0

  2. If he table is not detected by train without labels and if you are using train with labels and the table is not detected automatically than we do not yet support labeling of tables natively. You could try labeling the table as key values pairs as a workaround to extract the values. When labeling tables as key value pairs label each cell as a value so for the above table you should have 5 values per column - Desc1, Desc2, Desc3...Desc5, Hours1, Hours2, Hours3, ...Hours5. In this case you will need to train with tables with the maximum number of rows.

Neta - MSFT

Upvotes: 0

Related Questions