Reputation: 1500
I've been reading a bunch of posts and stuff on bad outputs from Tesseract .Net wrapper with various image "types", but I couldn't figure out a solution to my bad output.
Here's the picture I'm trying to parse:
As you can see there are different fonts, sizes, foregrounds and backgrounds. I tried to grayscale it and upscale it by different amounts but nothing comes close to correctly parsing the whole image.
TesseractEngine ocr = new TesseractEngine(Path.Combine(Environment.CurrentDirectory, "tessdata"), "fra", EngineMode.Default);
ocr.SetVariable("tessedit_char_whitelist", "ABCDEFGHIJKLMNOPQRSTUVWXYZÉÈ0123456789:'");
Page pg = ocr.Process(image.ToGrayscale().ScaleByPercent(200));
MessageBox.Show(pg.GetText());
With this code (let me know if the details of ToGrayScale()
and ScaleByPercent(...)
would help), here's the output I get:
8300 QÉMQ I09'0'9I
PIOII' :
Which seemingly corresponds to Bacc. génie logiciel
& Profil :
.
That being said, I know very little on image transformation so examples or hints would greatly help, but I'm totally willing to dig into linked stuff/documentation if necessary. How should I proceed to process such an image ?
EDIT: With some manips (suggested by @Yves Daoust) I've managed to reach this point:
However the output (on the right) isn't quite perfect yet. I've been struggling still to provide configs to the Tesseract so that it would only accept words from a certain list. Here's my attempt:
var initVars = new Dictionary<string, object>() {
{ "load_system_dawg", false },
{ "user_words_suffix", "fra.user-words" },
{ "language_model_penalty_non_freq_dict_word", 1 },
{ "language_model_penalty_non_dict_word", 1 }
};
TesseractEngine ocr = new TesseractEngine(Path.Combine(Environment.CurrentDirectory, "tessdata"), "fra", EngineMode.Default,
Enumerable.Empty<string>(), initVars, false);
I've been looking for examples on how to provide such configs but I've only found short, undetailed textual explanations.
Upvotes: 2
Views: 2890
Reputation:
You can help Tesseract to a great extent by extracting the characters yourself, which is pretty trivial here: retain only the pixels in white (and in other colors for other parts of the form).
By the way, the characters are so predictable that you could do the recognition yourself (by simple pixel-wise comparison), without the help of Tesseract.
Upvotes: 3