Tesseract OCR not splitting lines correctly

Question

for my application I need to use OCR to extract text from invoices. To achieve this, I crop the invoice I need to scan to the individual columns and put these cropped images through tesseract. For the majority of the columns this works perfectly, but there are a few where it doesn't split the lines and it outputs everything in the same string.

What I am currently trying is to use the string.split() method using " " and " " as parameters.

The code below shows how exactly I am attempting to split the output into an array of strings:

public string[] ProcessFile(Image InputImage)
        {
            Bitmap WorkImage = new Bitmap(InputImage);
            string[] Output;

            Tesseract.TesseractEngine Engine = new TesseractEngine("./tessdata", "eng", EngineMode.TesseractAndCube);
            Page RawOutput = Engine.Process(WorkImage);
            string ConvertedOutput = RawOutput.GetText();
            Output = ConvertedOutput.Split(new[] { "
", "
", "
" }, StringSplitOptions.None);
            Engine.Dispose();
            return Output;
        }

For columns that contain values like "product 1" "product 2" "product 3" etc this works just fine, but when the column contains individual numbers, like so: "1" "4" "12" "6"

It only returns "14126".

I hope anyone is able to point me towards a solution to this. Many thanks in advance!

Youp Bernoulli · Accepted Answer

Check the GitHub wiki of Tesseract https://github.com/tesseract-ocr/tessdoc

You can use the PageSegmentationMode, PageSegMode.SingleBlock to accomplish what you are looking for.

Tesseract OCR not splitting lines correctly

Answers (1)

Related Questions