Michael Liu
Michael Liu

Reputation: 55389

Length of substring matched by culture-sensitive String.IndexOf method

I tried writing a culture-aware string replacement method:

public static string Replace(string text, string oldValue, string newValue)
{
    int index = text.IndexOf(oldValue, StringComparison.CurrentCulture);
    return index >= 0
        ? text.Substring(0, index) + newValue + text.Substring(index + oldValue.Length)
        : text;
}

However, it chokes on Unicode combining characters:

// \u0301 is Combining Acute Accent
Console.WriteLine(Replace("déf", "é", "o"));       // 1. CORRECT: dof
Console.WriteLine(Replace("déf", "e\u0301", "o")); // 2. INCORRECT: do
Console.WriteLine(Replace("de\u0301f", "é", "o")); // 3. INCORRECT: dóf

To fix my code, I need to know that in the second example, String.IndexOf matched only one character (é) even though it searched for two (e\u0301). Similarly, I need to know that in the third example, String.IndexOf matched two characters (e\u0301) even though it only searched for one (é).

How can I determine the actual length of the substring matched by String.IndexOf?

NOTE: Performing Unicode normalization on text and oldValue (as suggested by James Keesey) would accommodate combining characters, but ligatures would still be a problem:

Console.WriteLine(Replace("œf", "œ", "i"));  // 4. CORRECT: if
Console.WriteLine(Replace("œf", "oe", "i")); // 5. INCORRECT: i
Console.WriteLine(Replace("oef", "œ", "i")); // 6. INCORRECT: ief

Upvotes: 14

Views: 1006

Answers (4)

Michael Liu
Michael Liu

Reputation: 55389

As of .NET 5, the CompareInfo.IndexOf method has an overload that returns the number of matched characters via an out parameter:

public int IndexOf(
    ReadOnlySpan<char> source, ReadOnlySpan<char> value, CompareOptions options,
    out int matchLength);

So the culture-aware string replacement method can be rewritten like this:

public static string Replace(string text, string oldValue, string newValue)
{
    int index = CultureInfo.CurrentCulture.CompareInfo.IndexOf(text, oldValue, CompareOptions.IgnoreCase, out int matchLength);
    return index >= 0
        ? text.Substring(0, index) + newValue + text.Substring(index + matchLength)
        : text;
}

Here is the result:

// \u0301 is Combining Acute Accent
Console.WriteLine(Replace("déf", "é", "o"));       // 1. CORRECT: dof
Console.WriteLine(Replace("déf", "e\u0301", "o")); // 2. CORRECT: dof
Console.WriteLine(Replace("de\u0301f", "é", "o")); // 3. CORRECT: dof

Also as of .NET 5, the ligature "œ" no longer matches "oe" in the en-US culture due to the switch from NLS to ICU. Reverting back to NLS allows examples 4-6 to work properly:

Console.WriteLine(Replace("œf", "œ", "i"));  // 4. CORRECT: if
Console.WriteLine(Replace("œf", "oe", "i")); // 5. CORRECT: if
Console.WriteLine(Replace("oef", "œ", "i")); // 6. CORRECT: if

Upvotes: 0

Tim S.
Tim S.

Reputation: 56536

Using the following methods works for your examples. It works by comparing values until it finds how many characters are needed in the source string to equal the oldValue, and using that instead of simply oldValue.Length.

public static string Replace(string text, string oldValue, string newValue)
{
    int index = text.IndexOf(oldValue, StringComparison.CurrentCulture);
    if (index >= 0)
        return text.Substring(0, index) + newValue +
                 text.Substring(index + LengthInString(text, oldValue, index));
    else
        return text;
}
static int LengthInString(string text, string oldValue, int index)
{
    for (int length = 1; length <= text.Length - index; length++)
        if (string.Equals(text.Substring(index, length), oldValue,
                                            StringComparison.CurrentCulture))
            return length;
    throw new Exception("Oops!");
}

Upvotes: 2

David Ewen
David Ewen

Reputation: 3732

You will need to directly call FindNLSString or FindNLSStringEx yourself. String.IndexOf uses FindNLSStringEx but all the information you need is available in FindNLSString.

Here is an example of how to rewrite your Replace method that works against your test cases. Note that I am using the current user locale read up the API documentation if you want to use the system locale or provide your own. I am also passing in 0 for the flags which means it will use the default string comparison options for the locale, again the documentation can help you provide different options.

public const int LOCALE_USER_DEFAULT = 0x0400;

[DllImport("kernel32.dll", SetLastError = true, ExactSpelling = true)]
internal static extern int FindNLSString(int locale, uint flags, [MarshalAs(UnmanagedType.LPWStr)] string sourceString, int sourceCount, [MarshalAs(UnmanagedType.LPWStr)] string findString, int findCount, out int found);

public static string ReplaceWithCombiningCharSupport(string text, string oldValue, string newValue)
{
    int foundLength;
    int index = FindNLSString(LOCALE_USER_DEFAULT, 0, text, text.Length, oldValue, oldValue.Length, out foundLength);
    return index >= 0 ? text.Substring(0, index) + newValue + text.Substring(index + foundLength) : text;
}

Upvotes: 5

James Keesey
James Keesey

Reputation: 1217

I spoke too soon (and had never seen this method before) but there is an alternative. You can use the StringInfo.ParseCombiningCharacters() method to get the start of each actual character and use that to determine the length of the string to replace.


You will need to normalize both strings before you do the Index call. This will make sure that the source and target strings are the same length.

See the String.Normalize() reference page which describes this exact problem.

Upvotes: 2

Related Questions