Tesseract OCR configurations and image manipulations

Question

I've been reading a bunch of posts and stuff on bad outputs from Tesseract .Net wrapper with various image "types", but I couldn't figure out a solution to my bad output.

Here's the picture I'm trying to parse:

As you can see there are different fonts, sizes, foregrounds and backgrounds. I tried to grayscale it and upscale it by different amounts but nothing comes close to correctly parsing the whole image.

TesseractEngine ocr = new TesseractEngine(Path.Combine(Environment.CurrentDirectory, "tessdata"), "fra", EngineMode.Default);
ocr.SetVariable("tessedit_char_whitelist", "ABCDEFGHIJKLMNOPQRSTUVWXYZÉÈ0123456789:'");
Page pg = ocr.Process(image.ToGrayscale().ScaleByPercent(200));
MessageBox.Show(pg.GetText());

With this code (let me know if the details of ToGrayScale() and ScaleByPercent(...) would help), here's the output I get:

8300 QÉMQ I09'0'9I

PIOII' :

Which seemingly corresponds to Bacc. génie logiciel & Profil :.

That being said, I know very little on image transformation so examples or hints would greatly help, but I'm totally willing to dig into linked stuff/documentation if necessary. How should I proceed to process such an image ?

EDIT: With some manips (suggested by @Yves Daoust) I've managed to reach this point:

However the output (on the right) isn't quite perfect yet. I've been struggling still to provide configs to the Tesseract so that it would only accept words from a certain list. Here's my attempt:

var initVars = new Dictionary() {
            { "load_system_dawg", false },
            { "user_words_suffix", "fra.user-words" },
            { "language_model_penalty_non_freq_dict_word", 1 },
            { "language_model_penalty_non_dict_word", 1 }
        };
TesseractEngine ocr = new TesseractEngine(Path.Combine(Environment.CurrentDirectory, "tessdata"), "fra", EngineMode.Default, 
            Enumerable.Empty(), initVars, false);

I've been looking for examples on how to provide such configs but I've only found short, undetailed textual explanations.

Tesseract OCR configurations and image manipulations

Answers (1)

Related Questions