Reputation: 157
I work with text recognition on documents. Library for .NET Core
var client = ImageAnnotatorClient.Create();
ImageContext context = new ImageContext();
foreach (var hints in context.LanguageHints)
{
Console.WriteLine("hint " + hints);
}
var response = client.DetectDocumentText(fromBytes, context);
In Russian there is a phrase "Однажды вечером"(One evening)
If I use Google DOCUMENT_TEXT_DETECTION without language hints, then I get the result "ОДНАЖДЫ BELEPOM" (First word correct! Second - fail)
Okay, maybe it doubts the language, we will indicate clearly that it is Russian
context.LanguageHints.Add("ru");
var response = client.DetectDocumentText(fromBytes, context);
The result is the same - the second word is entirely in Latin.
I think, somehow I am asking a hint wrong? Let's try other languages for examples
context.LanguageHints.Add("en");
var response = client.DetectDocumentText(fromBytes, context);
result:
DAHAYKALI BELEPOM
As we can see, the hints functionality works by itself, it's just that the Russian word "ВЕЧЕРОМ"(On Evening) is recognized with such low confidence that fallback to used in English OCR.
The question is - how do you force disable all OCR module, except only one language?(russian in this case) Let him at least try to recognize Russian letters, albeit with less confidence.
Upvotes: 2
Views: 806
Reputation: 31
As per this link it is not possible to restrict the language, it is only a hint.
The solution, not sure if it will work, is to use tesseract with a restricted characters set and run this text through it. https://stackoverflow.com/a/70345684/14980950.
Upvotes: 0