Ben Walker
Ben Walker

Reputation: 2137

Regex or string compare with allowance of error

I'm trying to do a string compare in C# with some allowance for error. For example, if my search term is "Welcome", but if my comparison string (generated through OCR) is "We1come" and my error allowance is 20%, that should match. That part isn't so difficult using something like the Levenshtein algorithm. The hard part is making it work within a larger block of text, like a regular expression. For example, maybe my OCR result is "Hello. My name is Ben. We1come to my StackOverflow question.", I would like to pick out that We1come as a good result compared to my search term.

Upvotes: 2

Views: 280

Answers (2)

Milne
Milne

Reputation: 848

Took quite a while, but this works well. Fun problem :)

string PossibleString = PossibleString.ToString().ToLower();
string StaticText = StaticText.ToLower();
decimal PossibleStringLength = (PossibleString.Length);
decimal StaticTextLength = (StaticText.Length);
decimal NumberOfErrorsAllowed = Math.Round((StaticTextLength * (ErrorAllowance / 100)), MidpointRounding.AwayFromZero);
int LevenshteinDistance = LevenshteinAlgorithm(StaticText, PossibleString);
string PossibleResult = string.Empty;

if (LevenshteinDistance == PossibleStringLength - StaticTextLength)
{
    // Perfect match. no need to calculate.
    PossibleResult = StaticText;
}
else
{
    int TextLengthBuffer = (int)StaticTextLength - 1;
    int LowestLevenshteinNumber = 999999;

    for (int i = 0; i < 3; i++) // Check for best results with same amount of characters as expected, as well as +/- 1
    {
        for (int e = TextLengthBuffer; e <= (int)PossibleStringLength; e++)
        {
            string possibleResult = (PossibleString.Substring((e - TextLengthBuffer), TextLengthBuffer));
            int lAllowance = (int)(Math.Round((possibleResult.Length - StaticTextLength) + (NumberOfErrorsAllowed), MidpointRounding.AwayFromZero));
            int lNumber = LevenshteinAlgorithm(StaticText, possibleResult);

            if (lNumber <= lAllowance && ((lNumber < LowestLevenshteinNumber) || (TextLengthBuffer == StaticText.Length && lNumber <= LowestLevenshteinNumber)))
            {
                PossibleResult = possibleResult;
                LowestLevenshteinNumber = lNumber;
            }
        }
        TextLengthBuffer++;
    }
}


public static int LevenshteinAlgorithm(string s, string t)
{
    int n = s.Length;
    int m = t.Length;
    int[,] d = new int[n + 1, m + 1];

    if (n == 0)
    {
        return m;
    }

    if (m == 0)
    {
        return n;
    }

    for (int i = 0; i <= n; d[i, 0] = i++)
    {
    }

    for (int j = 0; j <= m; d[0, j] = j++)
    {
    }

    for (int i = 1; i <= n; i++)
    {
        for (int j = 1; j <= m; j++)
        {
            int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

            d[i, j] = Math.Min(
                Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                d[i - 1, j - 1] + cost);
        }
    }
    return d[n, m];
}

Upvotes: 1

Stephan
Stephan

Reputation: 43053

If it is somehow predictable how the OCR can miss letters, I would replace the letters in the search with misses.

If the search is Welcome, the regex would be (?i)We[l1]come.

Upvotes: 0

Related Questions