Table data extraction from image or scanned documents (Not pdf)

Question

I want to extract the table data from images or scanned documents and map the header fields to their particular values mostly in an insurance document.I have tried by extracting them line by line and then mapping them using their position on the page. I gave the table boundary by defining a table start and end pivot, but it doesn't give me proper result, since headers have multiple lines sometimes (I had implemented this in php). I also want to know whether I can use machine learning to achieve the same.

For pdf documents I have used tabula-java which worked pretty well for me. Is there a similar type of implementation for images as well?

Insurance_Image

The documents would be of similar type as in the link above but of different service providers so a generic method of extracting such data would be very useful.

In the image above I want map values like Make = YAMAHA, MODEL= FZ-S, CC= 153 etc

Thanks.

Lukasz Tracewski · Accepted Answer

I would definitively give a go to Tesseract, a very good OCR engine. I have been using it successfully in reading all sorts of documents embedded in emails (PDF, images) and a colleague of mine used it for something very similar to your use case - reading specific fields from invoices.

After you parse the document, simply use regex to pick the fields of interest.

I don't think machine learning would be particularly useful for you, unless you plan to build your own OCR engine. I'd start with existing libraries, they offer very good performance.

Table data extraction from image or scanned documents (Not pdf)

Answers (2)

Related Questions