Melvin Winthagen
Melvin Winthagen

Reputation: 347

Tesseract OCR not splitting lines correctly

for my application I need to use OCR to extract text from invoices. To achieve this, I crop the invoice I need to scan to the individual columns and put these cropped images through tesseract. For the majority of the columns this works perfectly, but there are a few where it doesn't split the lines and it outputs everything in the same string.

What I am currently trying is to use the string.split() method using "\n" and "\r" as parameters.

The code below shows how exactly I am attempting to split the output into an array of strings:

public string[] ProcessFile(Image InputImage)
        {
            Bitmap WorkImage = new Bitmap(InputImage);
            string[] Output;

            Tesseract.TesseractEngine Engine = new TesseractEngine("./tessdata", "eng", EngineMode.TesseractAndCube);
            Page RawOutput = Engine.Process(WorkImage);
            string ConvertedOutput = RawOutput.GetText();
            Output = ConvertedOutput.Split(new[] { "\r\n", "\r", "\n" }, StringSplitOptions.None);
            Engine.Dispose();
            return Output;
        }

For columns that contain values like "product 1" "product 2" "product 3" etc this works just fine, but when the column contains individual numbers, like so: "1" "4" "12" "6"

It only returns "14126".

I hope anyone is able to point me towards a solution to this. Many thanks in advance!

Upvotes: 1

Views: 3122

Answers (1)

Youp Bernoulli
Youp Bernoulli

Reputation: 5655

Check the GitHub wiki of Tesseract https://github.com/tesseract-ocr/tessdoc

You can use the PageSegmentationMode, PageSegMode.SingleBlock to accomplish what you are looking for.

Upvotes: 1

Related Questions