Reputation: 8115
Given this string
HELLO𝄞水
Legend: http://en.wikipedia.org/wiki/UTF-16
𝄞 is 4 bytes
水 is 2 bytes
Postgresql database (UTF-8) returns the correct length of 7:
select length('HELLO𝄞水');
I noticed both .NET and Java returns 8:
Console.WriteLine("HELLO𝄞水");
System.out.println("HELLO𝄞水");
And Sql Server returns 8 too:
SELECT LEN('HELLO𝄞水');
.NET,Java and Sql Server returns correct string length when a given unicode character is not variable-length, they all return 6:
HELLO水
They return 7 for variable-length ones, which is incorrect:
HELLO𝄞
.NET,Java and Sql Server uses UTF-16. It seems that their implementation of counting the length of UTF-16 string is broken. Or is this mandated by UTF-16? UTF-16 is variable-length capable as its UTF-8 cousin. But why UTF-16 (or is it the fault of .NET,Java,SQL Server and whatnot?) is not capable of counting the length of string correctly like with UTF-8?
Python returns a length of 12, I dont know how to interpret why it returns 12 though. This might be another topic entirely, I digress.
len("HELLO𝄞水")
Question is, how do I get the correct count of characters on .NET, Java and Sql Server? It will be difficult to implement the next twitter if a function returns incorrect character count.
If I may add, I was not able to post this using Firefox. I posted this question in Google Chrome. Firefox cannot display variable-length unicodes
Upvotes: 3
Views: 332
Reputation: 136
.Net: String.Length Property
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.
So we should use StringInfo class to get the correct count of Unicode characters.
String s = "HELLO𝄞水";
Console.WriteLine (s);
Console.WriteLine ("Count of char: {0:d}", s.Length);
StringInfo info = new StringInfo (s);
Console.WriteLine ("Count of Unicode characters: {0:d}", info.LengthInTextElements);
The output:
HELLO𝄞水
Count of char: 8
Count of Unicode characters: 7
Upvotes: 0
Reputation: 44808
In Java:
String s = "HELLO𝄞水";
System.out.println(s.codePointCount(0, s.length())); // 7
System.out.println(s.length()); // 8
Upvotes: 3
Reputation: 100527
C# (and likely SQL and Java) are returning number of Char elements in a string.
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.
Upvotes: 4