Reputation: 763

How to read PDF templates using java OCR

Can some one suggest a solution for the below scenario ?

We have menus from restaurants. Each restaurant has its own menu. The goal is to identify the elements in the menu such as menu item, toppings, prices etc and update the database.

Fox example : A restaurant menu can contain menu items such as "Chicken", "Vegetarian" etc under a group called "Sandwiches.

For that I am planning to use a java implementation of OCR. Will this work out ?

Upvotes: 1

Answers (4)

nexus

Reputation: 172

Convert the PDF to an image (using javacv etc) and OCR it using tesseract or tess4j. It is not a permanent or the best solution, but it works great!

Upvotes: 1

Thorn

Reputation: 4057

Interesting project! Java or any other language, I would think that OCR is not accurate enough for what you need. Menus are often printed with non-standard fonts and sometimes with background images making it difficult for OCR to accurately read every word. Then you have the challenge of formatting. Some menus may organize the content by Chicken, Vegetarian, Beef. Others may have categories like Light Fare, Entree, Appetizer, small plates.

This strikes me as a real data engineering challenge. While menus seem like they are hierarchical, they actual structure is very flexible and varies a great deal from one to another. Adding OCR to this mess adds typos to this whole mess, and now you need to be looking for words like "chicken" because you may actually have Chicen or Cichen or (h1ckn.

Maybe I've never used really great OCR software and I'm imagining a problem that isn't there. I would think that most restaurants type their menus on computers and you are better off trying to get them to share those files with you.

Upvotes: 0

lunar4dev

Reputation: 174

If u want to use OCR inside your code you can go with Tessrect-OCR with some native developement.Its a very powerfull library with having quick output.this link is for wrapper class for Tessrect or you can also use Tess4j alternative to Tesjeract(first one).This is the same library used by google and u can also add multiple languages support.

Upvotes: 1

Alex Coleman

Reputation: 7326

If you are typing up the PDF, then using it, there's no need to do this; simply read the PDF (see below). However, if you are scanning in the PDF (an image, not text), you will need to resort to OCR.

To read the PDF from a file, you could use something like iText or PDFBox

Upvotes: 0

How to read PDF templates using java OCR

Answers (4)

Related Questions