Reputation: 847
I have a Delphi 7 application where I deal with ANSI strings and I need to count their number of characters (as opposed to the number of bytes). I always know the Charset (and thus the code page) associated with the string.
So, knowing the Charset (code page), I'm currently using MultiByteToWideChar
to get the number of characters. It's useful when the Charset is one of the Chinese, Korean, or Japanese charsets where most of the characters are 2 bytes in length and simply using the Length
function won't give me what I want.
However, it still counts composite characters as two characters, and I need them counted as one. Now, some composite characters have precomposed versions in Unicode, those would be counted correctly as one character since the MB_PRECOMPOSED
is used by default. But many characters simply don't exist as precomposed, for example characters in Hebrew, Arabic, Thai, etc, and those are counted as two.
So the question really is: How to count composite characters as single characters? I don't mind converting the ANSI strings to Wide strings to count the number of characters, I'm already doing it with MultiByteToWideChar
anyway.
Upvotes: 2
Views: 899
Reputation: 612784
You can count the Unicode code points like this:
function CodePointCount(P: PWideChar): Integer;
var
Count: Integer;
begin
Count := 0;
while Word(P^)<>0 do
begin
if (Word(P^)>=$D800) and (Word(P^)<=$DFFF) then
// part of surrogate pair
inc(Count)
else
inc(Count, 2);
inc(P);
end;
Result := Count div 2;
end;
This covers the issue that you did not mention. Namely that UTF-16 is a variable width encoding.
However, this will not tell you the number of glyphs represented by a UTF-16 string. That's because some code points represent combining characters. These combining characters combine with their neighbours to form a single equivalent character. So, multiple code-points, single glyph. More information can be found here: http://en.wikipedia.org/wiki/Unicode_equivalence
This is the harder issue. To solve it your code needs to fully understand the meaning of each Unicode code point. Is it a combining character? How does it combine? Really you need a dedicated Unicode library. For instance ICU.
The other suggestion I have for you is to give up using ANSI code pages. If you really care about internationalisation then you need to use Unicode.
Upvotes: 2