Reputation: 23
I am using Delphi 6.
I want to decode a Portuguese UTF-8 encoded string to a WideString
, but I found that it isn't decoding correctly.
The original text is "ANÁLISE8"
. After using UTF8Decode()
, the result is "ANALISE8"
. The symbol on top of the "A"
disappears.
Here is the code:
var
f : textfile;
s : UTF8String;
w, test : WideString;
begin
while not eof(f) do
begin
readln(f,s);
w := UTF8Decode(s);
How can I decode the Portuguese UTF-8 string to WideString
correctly?
Upvotes: 2
Views: 1366
Reputation: 595402
Note that the implementation of UTF8Decode()
in Delphi 6 is incomplete. Specifically, it does not support encoded 4-byte sequences, which are needed to handle Unicode codepoints above U+FFFF
. Which means UTF8Decode()
can only decode Unicode codepoints in the UCS-2 range, not the full Unicode repertoire. Thus making UTF8Decode()
basically useless in Delphi 6 (and all the way up to Delphi 2007 - it was finally fixed in Delphi 2009).
Try using the Win32 MultiByteToWideChar()
function instead, eg:
uses
..., Windows;
function MyUTF8Decode(const s: UTF8String): WideString;
var
Len: Integer;
begin
Len := MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(s), Length(s), nil, 0);
SetLength(Result, Len);
if Len > 0 then
MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(s), Length(s), PWideChar(Result), Len));
end;
var
f : textfile;
s : UTF8String;
w, test : WideString;
begin
while not eof(f) do
begin
readln(f,s);
w := MyUTF8Decode(s);
That being said, your ANÁLISE8
string falls within the UCS-2 range, so I tested UTF8Decode()
in Delphi 6 and it decoded the UTF-8 encoded form of ANÁLISE8
just fine. I would conclude that either:
your UTF8String
variable DOES NOT contain the UTF-8 encoded form of ANÁLISE8
to begin with (byte sequence 41 4E C3 81 4C 49 53 45 38
), but instead contains the ASCII string ANALISE8
instead (byte sequence 41 4E 41 4C 49 53 45 38
), which would decode as-is since ASCII is a subset of UTF-8. Double check your file, and the output of Readln()
.
your WideString
contains ANÁLISE8
correctly as expected, but the way you are outputting/debugging it (which you did not show) is converting it to ANSI, losing the Á
during the conversion.
Upvotes: 3