Reputation: 347
for my application I need to use OCR to extract text from invoices. To achieve this, I crop the invoice I need to scan to the individual columns and put these cropped images through tesseract. For the majority of the columns this works perfectly, but there are a few where it doesn't split the lines and it outputs everything in the same string.
What I am currently trying is to use the string.split() method using "\n" and "\r" as parameters.
The code below shows how exactly I am attempting to split the output into an array of strings:
public string[] ProcessFile(Image InputImage)
{
Bitmap WorkImage = new Bitmap(InputImage);
string[] Output;
Tesseract.TesseractEngine Engine = new TesseractEngine("./tessdata", "eng", EngineMode.TesseractAndCube);
Page RawOutput = Engine.Process(WorkImage);
string ConvertedOutput = RawOutput.GetText();
Output = ConvertedOutput.Split(new[] { "\r\n", "\r", "\n" }, StringSplitOptions.None);
Engine.Dispose();
return Output;
}
For columns that contain values like "product 1" "product 2" "product 3" etc this works just fine, but when the column contains individual numbers, like so: "1" "4" "12" "6"
It only returns "14126".
I hope anyone is able to point me towards a solution to this. Many thanks in advance!
Upvotes: 1
Views: 3122
Reputation: 5655
Check the GitHub wiki of Tesseract https://github.com/tesseract-ocr/tessdoc
You can use the PageSegmentationMode
, PageSegMode.SingleBlock
to accomplish what you are looking for.
Upvotes: 1