Potion
Potion

Reputation: 157

Google cloud vision DOCUMENT_TEXT_DETECTION language hints - how i can make force use only one language?

I work with text recognition on documents. Library for .NET Core

var client = ImageAnnotatorClient.Create();
                ImageContext context = new ImageContext();
                foreach (var hints in context.LanguageHints)
                {
                    Console.WriteLine("hint " + hints);
                }
               var response = client.DetectDocumentText(fromBytes, context);

In Russian there is a phrase "Однажды вечером"(One evening)

enter image description here

If I use Google DOCUMENT_TEXT_DETECTION without language hints, then I get the result "ОДНАЖДЫ BELEPOM" (First word correct! Second - fail)

Okay, maybe it doubts the language, we will indicate clearly that it is Russian

  context.LanguageHints.Add("ru");
  var response = client.DetectDocumentText(fromBytes, context);

The result is the same - the second word is entirely in Latin.

I think, somehow I am asking a hint wrong? Let's try other languages ​​for examples

context.LanguageHints.Add("en");
  var response = client.DetectDocumentText(fromBytes, context);

result:

DAHAYKALI BELEPOM

As we can see, the hints functionality works by itself, it's just that the Russian word "ВЕЧЕРОМ"(On Evening) is recognized with such low confidence that fallback to used in English OCR.

The question is - how do you force disable all OCR module, except only one language?(russian in this case) Let him at least try to recognize Russian letters, albeit with less confidence.

Upvotes: 2

Views: 806

Answers (2)

Data Cyclist
Data Cyclist

Reputation: 31

As per this link it is not possible to restrict the language, it is only a hint.

The solution, not sure if it will work, is to use tesseract with a restricted characters set and run this text through it. https://stackoverflow.com/a/70345684/14980950.

Upvotes: 0

InUser
InUser

Reputation: 1137

Compare your request to the one in the API , in the api you receive the correct value. "Однажды вечером" (scroll down you have the request json)

Upvotes: 0

Related Questions