Reputation: 26336

Substring UTF-8 for both Latin, Chinese, Cyrillic, etc

On Windows Phone, I want to substring any given string to what's equivalent of 100 ASCII characters in length.

String.Length is obviously useless, as a Chinese string uses 3 bytes per character, a Danish string uses 2 or 4 bytes per character, and a Russian string uses 4 bytes per character.

The only available encoding are UTF-8 and UTF-16. So what do I do?

The idea is this:

private static string UnicodeSubstring(string text, int length)
{
    var bytes = Encoding.UTF8.GetBytes(text);

    return Encoding.UTF8.GetString(bytes, 0, Math.Min(bytes.Length, length));
}

But the length needs to be correctly dividable with the number of bytes used for each character, so the last character is always rendered correctly.

Upvotes: 6

Answers (4)

John Zabroski

Reputation: 2357

While this is an extremely old question, I believe the right approach is to use the System.Globalization.StringInfo class's StringInfo.SubstringByTextElements Method. The major advantage of this is .NET documentation guarantees for net461 and up, StringInfo's Notes To Callers guarantees conformance to the Unicode Standard Version 8.0.0:

Notes to Callers

Internally, the methods of the StringInfo class call the methods of the CharUnicodeInfo class to determine character categories. Starting with the .NET Framework 4.6.2, character classification is based on The Unicode Standard, Version 8.0.0. For the .NET Framework 4 through the .NET Framework 4.6.1, it is based on The Unicode Standard, Version 6.3.0. In .NET Core, it is based on The Unicode Standard, Version 8.0.0.

Now, how do you actually call SubstringByTextElements, given there are no examples on Microsoft Docs on how to call it?

In the StringInfo class, there is a note that says:

By calling the ParseCombiningCharacters method to retrieve an array that contains the starting index of each text element. You can then retrieve individual text elements by passing these indexes to the SubstringByTextElements method.

So:

Call ParseCombinigCharacters to get starting index of each text element
Call SubstringByTextElements using the indexes provided by step one.

Upvotes: 1

Jon Skeet

Reputation: 1502246

One option is to simply go through the string, computing the number of bytes for each character.

If you know you don't need to deal with characters outside the BMP, this is reasonably simple:

public string SubstringWithinUtf8Limit(string text, int byteLimit)
{
    int byteCount = 0;
    char[] buffer = new char[1];
    for (int i = 0; i < text.Length; i++)
    {
        buffer[0] = text[i];
        byteCount += Encoding.UTF8.GetByteCount(buffer);
        if (byteCount > byteLimit)
        {
            // Couldn't add this character. Return its index
            return text.Substring(0, i);
        }
    }
    return text;
}

It becomes slightly trickier if you have to handle surrogate pairs :(

Upvotes: 7

Claus Jørgensen

Reputation: 26336

An idea is also to check if the last character is the Unicode Replace Character , and remove one character until it's rendered correctly.

private static string UnicodeSubstring(string text, int length)
{
    var bytes = Encoding.UTF8.GetBytes(text);
    var result = Encoding.UTF8.GetString(bytes, 0, Math.Min(bytes.Length, length));

    while ('\uFFFD' == result[result.Length - 1])
    {
        result = result.Substring(0, result.Length - 1);
    }

    return result;
}

Upvotes: 0

Alexei Levenkov

Reputation: 100545

One option is to simply add "characters" (including surrogate pairs if you need to support them) to resulting string and see if it gets converted into correct number of whatever you want.

Upvotes: 1

Substring UTF-8 for both Latin, Chinese, Cyrillic, etc

Answers (4)

Notes to Callers

Related Questions