Jayant Pande
Jayant Pande

Reputation: 69

Table data extraction from image or scanned documents (Not pdf)

I want to extract the table data from images or scanned documents and map the header fields to their particular values mostly in an insurance document.I have tried by extracting them line by line and then mapping them using their position on the page. I gave the table boundary by defining a table start and end pivot, but it doesn't give me proper result, since headers have multiple lines sometimes (I had implemented this in php). I also want to know whether I can use machine learning to achieve the same.

For pdf documents I have used tabula-java which worked pretty well for me. Is there a similar type of implementation for images as well?

Insurance_Image

The documents would be of similar type as in the link above but of different service providers so a generic method of extracting such data would be very useful.

In the image above I want map values like Make = YAMAHA, MODEL= FZ-S, CC= 153 etc

Thanks.

Upvotes: 2

Views: 2604

Answers (2)

Elia
Elia

Reputation: 822

The easiest and most reliable way to do it without much knowledge in OCR would be this:
- Take an empty template for reference and mark the boxes coordinates that you need to extract the data from. Label them and save them for future use. This will be done only once for each template.
- Now when reading the same template, resize it to match the reference templates dimensions (If it's not already matching).
- You have already every box's coordinates and know what data it should contain (because you labeled them and saved them on the first step).
Which means that now you can just analyze the pixels contained in each box to know what is written there.
This means that given a list of labeled boxes (that you extracted in the first step), you should be able to get the data in each one of these boxes. If this data is typed and not hand written the extracted data would be easier to analyze or do whatever you want with it using simple OCR libraries.
Or if the data is always the same size and font like your example template above, then you could just build your own small database of letters of that font and size. or maybe full words? Depends on each box's possible answers.

Anyway this is not the best approach by far but it would definitely get the work done with minimal effort and knowledge in OCR.

Upvotes: 2

Lukasz Tracewski
Lukasz Tracewski

Reputation: 11377

I would definitively give a go to Tesseract, a very good OCR engine. I have been using it successfully in reading all sorts of documents embedded in emails (PDF, images) and a colleague of mine used it for something very similar to your use case - reading specific fields from invoices.

After you parse the document, simply use regex to pick the fields of interest.

I don't think machine learning would be particularly useful for you, unless you plan to build your own OCR engine. I'd start with existing libraries, they offer very good performance.

Upvotes: 1

Related Questions