Reputation: 85
I am developing an application that will be used to automate invoice indexing. One use-case of my application is to extract tables from scanned documents. To do this, I need to extract the coordinates of all the words in the text (if this is not possible, I could use the coordinates of the letters as well). I plan on using Tesseract 4.0 for C# and I wanted to know if this is possible.
Thank you
Upvotes: 1
Views: 4158
Reputation: 1
Dim Region As Rect
Using iterator = page.GetIterator()
iterator.Begin()
Do
Dim searchStr = iterator.GetText(PageIteratorLevel.Word)
For Each k As String In listOfSearch
If searchStr.IndexOf(k, StringComparison.OrdinalIgnoreCase) >= 0 Then
If iterator.TryGetBoundingBox(PageIteratorLevel.Word, Region) Then
Dim systemDrawingRect As New System.Drawing.Rectangle(Region.X1, Region.Y1, Region.Width, Region.Height)
pdfPage = HighlightTextOnImage(pdfPage, systemDrawingRect)
End If
End If
Next
Loop While iterator.[Next](PageIteratorLevel.Word)
End Using
Upvotes: 0
Reputation: 987
You can get bounding box for each recognized word . Below is a sample code using C# Tesseract wrapper.
//intialize the TesseractEngine
using (var engine = new TesseractEngine("path to tessdata folder", "eng", EngineMode.Default))
{
//image here is Bitmap on which OCR is to be performed
using (var page = engine.Process(image, PageSegMode.Auto))
{
using (var iterator = page.GetIterator())
{
iterator.Begin();
do
{
string currentWord = iterator.GetText(PageIteratorLevel.Word);
//do something with bounds
iterator.TryGetBoundingBox(PageIteratorLevel.Word, out Rect bounds);
}
while (iterator.Next(PageIteratorLevel.Word));
}
}
}
You can now store the bounds for each word and write your logic to map them to table row/columns based on their bounding box (this is the difficult part and if your table format is neat , you should be able to get it working with some effort.). Also, consider looking at Tabula library to see if it can solve problem at hand .
Upvotes: 2