Coordinate extraction using Tesseract 4.0

Question

I am developing an application that will be used to automate invoice indexing. One use-case of my application is to extract tables from scanned documents. To do this, I need to extract the coordinates of all the words in the text (if this is not possible, I could use the coordinates of the letters as well). I plan on using Tesseract 4.0 for C# and I wanted to know if this is possible.

Thank you

Nish26 · Accepted Answer

You can get bounding box for each recognized word . Below is a sample code using C# Tesseract wrapper.

 //intialize the TesseractEngine
  using (var engine = new TesseractEngine("path to tessdata folder", "eng", EngineMode.Default))
  {
      //image here is Bitmap on which OCR is to be performed
      using (var page = engine.Process(image, PageSegMode.Auto))
      {
          using (var iterator = page.GetIterator())
          {

              iterator.Begin();
              do
              {
                  string currentWord = iterator.GetText(PageIteratorLevel.Word);
                  //do something with bounds 
                  iterator.TryGetBoundingBox(PageIteratorLevel.Word, out Rect bounds);                                   
               }
               while (iterator.Next(PageIteratorLevel.Word));
          }
      }
   }

You can now store the bounds for each word and write your logic to map them to table row/columns based on their bounding box (this is the difficult part and if your table format is neat , you should be able to get it working with some effort.). Also, consider looking at Tabula library to see if it can solve problem at hand .

Coordinate extraction using Tesseract 4.0

Answers (2)

Related Questions