jedivader
jedivader

Reputation: 847

How to get the number of characters (as opposed to the number of bytes) of a text in Delphi?

I have a Delphi 7 application where I deal with ANSI strings and I need to count their number of characters (as opposed to the number of bytes). I always know the Charset (and thus the code page) associated with the string.

So, knowing the Charset (code page), I'm currently using MultiByteToWideChar to get the number of characters. It's useful when the Charset is one of the Chinese, Korean, or Japanese charsets where most of the characters are 2 bytes in length and simply using the Length function won't give me what I want.

However, it still counts composite characters as two characters, and I need them counted as one. Now, some composite characters have precomposed versions in Unicode, those would be counted correctly as one character since the MB_PRECOMPOSED is used by default. But many characters simply don't exist as precomposed, for example characters in Hebrew, Arabic, Thai, etc, and those are counted as two.

So the question really is: How to count composite characters as single characters? I don't mind converting the ANSI strings to Wide strings to count the number of characters, I'm already doing it with MultiByteToWideChar anyway.

Upvotes: 2

Views: 899

Answers (1)

David Heffernan
David Heffernan

Reputation: 612784

You can count the Unicode code points like this:

function CodePointCount(P: PWideChar): Integer;
var
  Count: Integer;
begin
  Count := 0;
  while Word(P^)<>0 do
  begin
    if (Word(P^)>=$D800) and (Word(P^)<=$DFFF) then
      // part of surrogate pair
      inc(Count)
    else 
      inc(Count, 2);
    inc(P);
  end;  
  Result := Count div 2;
end;

This covers the issue that you did not mention. Namely that UTF-16 is a variable width encoding.

However, this will not tell you the number of glyphs represented by a UTF-16 string. That's because some code points represent combining characters. These combining characters combine with their neighbours to form a single equivalent character. So, multiple code-points, single glyph. More information can be found here: http://en.wikipedia.org/wiki/Unicode_equivalence

This is the harder issue. To solve it your code needs to fully understand the meaning of each Unicode code point. Is it a combining character? How does it combine? Really you need a dedicated Unicode library. For instance ICU.

The other suggestion I have for you is to give up using ANSI code pages. If you really care about internationalisation then you need to use Unicode.

Upvotes: 2

Related Questions