Reputation: 26336
On Windows Phone, I want to substring any given string to what's equivalent of 100 ASCII characters in length.
String.Length is obviously useless, as a Chinese string uses 3 bytes per character, a Danish string uses 2 or 4 bytes per character, and a Russian string uses 4 bytes per character.
The only available encoding are UTF-8 and UTF-16. So what do I do?
The idea is this:
private static string UnicodeSubstring(string text, int length)
{
var bytes = Encoding.UTF8.GetBytes(text);
return Encoding.UTF8.GetString(bytes, 0, Math.Min(bytes.Length, length));
}
But the length needs to be correctly dividable with the number of bytes used for each character, so the last character is always rendered correctly.
Upvotes: 6
Views: 3499
Reputation: 2357
While this is an extremely old question, I believe the right approach is to use the System.Globalization.StringInfo
class's StringInfo.SubstringByTextElements
Method. The major advantage of this is .NET documentation guarantees for net461
and up, StringInfo
's Notes To Callers guarantees conformance to the Unicode Standard Version 8.0.0:
Notes to Callers
Internally, the methods of the StringInfo class call the methods of the CharUnicodeInfo class to determine character categories. Starting with the .NET Framework 4.6.2, character classification is based on The Unicode Standard, Version 8.0.0. For the .NET Framework 4 through the .NET Framework 4.6.1, it is based on The Unicode Standard, Version 6.3.0. In .NET Core, it is based on The Unicode Standard, Version 8.0.0.
Now, how do you actually call SubstringByTextElements, given there are no examples on Microsoft Docs on how to call it?
In the StringInfo
class, there is a note that says:
- By calling the
ParseCombiningCharacters
method to retrieve an array that contains the starting index of each text element. You can then retrieve individual text elements by passing these indexes to theSubstringByTextElements
method.
So:
Upvotes: 1
Reputation: 1502246
One option is to simply go through the string, computing the number of bytes for each character.
If you know you don't need to deal with characters outside the BMP, this is reasonably simple:
public string SubstringWithinUtf8Limit(string text, int byteLimit)
{
int byteCount = 0;
char[] buffer = new char[1];
for (int i = 0; i < text.Length; i++)
{
buffer[0] = text[i];
byteCount += Encoding.UTF8.GetByteCount(buffer);
if (byteCount > byteLimit)
{
// Couldn't add this character. Return its index
return text.Substring(0, i);
}
}
return text;
}
It becomes slightly trickier if you have to handle surrogate pairs :(
Upvotes: 7
Reputation: 26336
An idea is also to check if the last character is the Unicode Replace Character , and remove one character until it's rendered correctly.
private static string UnicodeSubstring(string text, int length)
{
var bytes = Encoding.UTF8.GetBytes(text);
var result = Encoding.UTF8.GetString(bytes, 0, Math.Min(bytes.Length, length));
while ('\uFFFD' == result[result.Length - 1])
{
result = result.Substring(0, result.Length - 1);
}
return result;
}
Upvotes: 0
Reputation: 100545
One option is to simply add "characters" (including surrogate pairs if you need to support them) to resulting string and see if it gets converted into correct number of whatever you want.
Upvotes: 1