John Ken
John Ken

Reputation: 23

Can Delphi 6 convert UTF-8 Portuguese to WideString?

I am using Delphi 6.

I want to decode a Portuguese UTF-8 encoded string to a WideString, but I found that it isn't decoding correctly.

The original text is "ANÁLISE8". After using UTF8Decode(), the result is "ANALISE8". The symbol on top of the "A" disappears.

Here is the code:

var
  f : textfile;
  s : UTF8String;
  w, test : WideString;    
begin
  while not eof(f) do
  begin
    readln(f,s);
    w := UTF8Decode(s);

How can I decode the Portuguese UTF-8 string to WideString correctly?

Upvotes: 2

Views: 1366

Answers (1)

Remy Lebeau
Remy Lebeau

Reputation: 595402

Note that the implementation of UTF8Decode() in Delphi 6 is incomplete. Specifically, it does not support encoded 4-byte sequences, which are needed to handle Unicode codepoints above U+FFFF. Which means UTF8Decode() can only decode Unicode codepoints in the UCS-2 range, not the full Unicode repertoire. Thus making UTF8Decode() basically useless in Delphi 6 (and all the way up to Delphi 2007 - it was finally fixed in Delphi 2009).

Try using the Win32 MultiByteToWideChar() function instead, eg:

uses
  ..., Windows;

function MyUTF8Decode(const s: UTF8String): WideString;
var
  Len: Integer;
begin
  Len := MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(s), Length(s), nil, 0);
  SetLength(Result, Len);
  if Len > 0 then
    MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(s), Length(s), PWideChar(Result), Len));
end;

var
  f : textfile;
  s : UTF8String;
  w, test : WideString;
begin
  while not eof(f) do
  begin
    readln(f,s);
    w := MyUTF8Decode(s);

That being said, your ANÁLISE8 string falls within the UCS-2 range, so I tested UTF8Decode() in Delphi 6 and it decoded the UTF-8 encoded form of ANÁLISE8 just fine. I would conclude that either:

  • your UTF8String variable DOES NOT contain the UTF-8 encoded form of ANÁLISE8 to begin with (byte sequence 41 4E C3 81 4C 49 53 45 38), but instead contains the ASCII string ANALISE8 instead (byte sequence 41 4E 41 4C 49 53 45 38), which would decode as-is since ASCII is a subset of UTF-8. Double check your file, and the output of Readln().

  • your WideString contains ANÁLISE8 correctly as expected, but the way you are outputting/debugging it (which you did not show) is converting it to ANSI, losing the Á during the conversion.

Upvotes: 3

Related Questions