Alexandra
Alexandra

Reputation: 85

Coordinate extraction using Tesseract 4.0

I am developing an application that will be used to automate invoice indexing. One use-case of my application is to extract tables from scanned documents. To do this, I need to extract the coordinates of all the words in the text (if this is not possible, I could use the coordinates of the letters as well). I plan on using Tesseract 4.0 for C# and I wanted to know if this is possible.

Thank you

Upvotes: 1

Views: 4158

Answers (2)

user3325078
user3325078

Reputation: 1

Dim Region As Rect

        Using iterator = page.GetIterator()
            iterator.Begin()
            Do
                Dim searchStr = iterator.GetText(PageIteratorLevel.Word)
                For Each k As String In listOfSearch
                    If searchStr.IndexOf(k, StringComparison.OrdinalIgnoreCase) >= 0 Then
                        If iterator.TryGetBoundingBox(PageIteratorLevel.Word, Region) Then
                            Dim systemDrawingRect As New System.Drawing.Rectangle(Region.X1, Region.Y1, Region.Width, Region.Height)
                            pdfPage = HighlightTextOnImage(pdfPage, systemDrawingRect)
                        End If
                    End If
                Next

            Loop While iterator.[Next](PageIteratorLevel.Word)

        End Using

Upvotes: 0

Nish26
Nish26

Reputation: 987

You can get bounding box for each recognized word . Below is a sample code using C# Tesseract wrapper.

 //intialize the TesseractEngine
  using (var engine = new TesseractEngine("path to tessdata folder", "eng", EngineMode.Default))
  {
      //image here is Bitmap on which OCR is to be performed
      using (var page = engine.Process(image, PageSegMode.Auto))
      {
          using (var iterator = page.GetIterator())
          {

              iterator.Begin();
              do
              {
                  string currentWord = iterator.GetText(PageIteratorLevel.Word);
                  //do something with bounds 
                  iterator.TryGetBoundingBox(PageIteratorLevel.Word, out Rect bounds);                                   
               }
               while (iterator.Next(PageIteratorLevel.Word));
          }
      }
   }

You can now store the bounds for each word and write your logic to map them to table row/columns based on their bounding box (this is the difficult part and if your table format is neat , you should be able to get it working with some effort.). Also, consider looking at Tabula library to see if it can solve problem at hand .

Upvotes: 2

Related Questions