M Schenkel
M Schenkel

Reputation: 6364

How to Decode utf-8 unicode characters with Indy

I have a TIdHttpServer application. I have a simple html document with special characters:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">


    <head>
        <meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
        <title>This is the title</title>
    </head>

    <body>
        <form method="post">
            <p>
                <input name="name" value="Все данные по веб-сайту" />
                <input type="submit" value="submit" />
            </p>
        </form>
    </body>
</html>

I serve this page and process the post. My "Get" code is below. Problem is I am unable to decode the %hh data properly.

procedure TForm3.Get(AContext: TIdContext;
  ARequestInfo: TIdHTTPRequestInfo; AResponseInfo: TIdHTTPResponseInfo);
var
  mFileName: String;
  txtFile: TextFile;
begin
  if ARequestInfo.Params.values['name']<>'' then begin
    AssignFile( txtFile , ChangeFileExt(ParamStr(0),'.log') );
    Append( TxtFile );
    WriteLn(TxtFile,'Unparsed:'+ARequestInfo.UnparsedParams);
    WriteLn(TxtFile,'Parsed:'+ARequestInfo.Params.values['name']);
    MyDecodeAndSetParams(ARequestInfo);
    WriteLn(TxtFile,'Decoded:'+ARequestInfo.Params.values['name']);
    System.Close( TxtFile );
  end ;
  mFileName := ExtractFileDir(ParamStr(0))+'\inputform.txt';
  AResponseInfo.ContentStream := TFileStream.Create(mFileName, fmOpenRead);

end;

The MyDecodeAndSetParams function:

procedure MyDecodeAndSetParams(ARequestInfo: TIdHTTPRequestInfo);
var
  i, j : Integer;
  value,s: string;
  LEncoding: IIdTextEncoding;
begin
  if IsHeaderMediaType(ARequestInfo.ContentType, 'application/x-www-form-urlencoded') then
  begin
    value := ARequestInfo.FormParams;
//    LEncoding := CharsetToEncoding(ARequestInfo.CharSet);
    if ARequestInfo.CharSet <> '' then
      LEncoding := CharsetToEncoding(ARequestInfo.CharSet)
    else
      LEncoding := IndyTextEncoding_UTF8;
  end else
  begin
    value := ARequestInfo.QueryParams;
    LEncoding := IndyTextEncoding_UTF8;
  end;

  ARequestInfo.Params.BeginUpdate;
  try
    ARequestInfo.Params.Clear;
    i := 1;
    while i <= Length(value) do
    begin
      j := i;
      while (j <= Length(value)) and (value[j] <> '&') do
      begin
        Inc(j);
      end;
      s := StringReplace(Copy(value, i, j-i), '+', ' ', [rfReplaceAll]);
      ARequestInfo.Params.Add(TIdURI.URLDecode(s, LEncoding));
      i := j + 1;
    end;
  finally
    ARequestInfo.Params.EndUpdate;
  end;
end;

The output in my file is as follows:

Unparsed:name=%D0%92%D1%81%D0%B5+%D0%B4%D0%B0%D0%BD%D0%BD%D1%8B%D0%B5+%D0%BF%D0%BE+%D0%B2%D0%B5%D0%B1-%D1%81%D0%B0%D0%B9%D1%82%D1%83
Parsed:οсе даннϿе по веб-сайϿϿ
Decoded:οсе даннϿе по веб-сайϿϿ

I can take the Unparsed data and decode it using this decoder and it returns the string properly:

Все данные по веб-сайту

What do I need to do so that I can properly decode the params to what they were on the form?

Upvotes: 2

Views: 3106

Answers (1)

Remy Lebeau
Remy Lebeau

Reputation: 595329

If AResponseInfo.CharSet is blank (because the client did not send a charset in the HTTP Content-Type header), CharsetToEncoding('') will return Indy's native 8bit charset rather than UTF-8. That is why your data is not being decoded properly.

For application/x-www-form-urlencoded, a charset is not always sent in the HTTP headers, as the client may assume the server knows the charset to expect based on the charset it sends the HTML in. It is also possible that the client might send a charset in the posted form data instead, such as in a _charset_ field.

Try changing this:

LEncoding := CharsetToEncoding(ARequestInfo.CharSet);

To this:

if ARequestInfo.CharSet <> '' then
  LEncoding := CharsetToEncoding(ARequestInfo.CharSet)
else
  LEncoding := IndyTextEncoding_UTF8;

This way, you default to UTF-8 unless the client sends an explicit charset.


Update: If you are using a pre-Unicode version of Delphi (2007 or earlier), Indy uses AnsiString instead of UnicodeString, so TIdURI.URLDecode() will first decode the input to Unicode using the specified AByteEncoding parameter (defaulting to IndyTextEncoding_UTF8 if none is specified), and will then convert the Unicode data to ANSI using the specified ADestEncoding parameter (defaulting to IndyTextEncoding_OSDefault if none is specified).

The Russian input you have shown decodes properly to Unicode when decoded as UTF-8, but can easily lose characters (turning them into '?') during the conversion to ANSI if your code is running on a machine that does not use a Russian charset at the OS layer, such as ISO-8859-5 or KOI8-R.

To ensure a correct conversion, you would have to specify the desired AnsiString encoding on those machines, eg:

var
  LEncoding, LAnsiEncoding: IIdTextEncoding;
...

LEncoding := IndyTextEncoding_UTF8;
LAnsiEncoding := CharsetToEncoding('ISO-8859-5'); // or 'KOI8-R', etc
...
ARequestInfo.Params.Add(TIdURI.URLDecode(s, LEncoding, LAnsiEncoding));

In Unicode versions of Delphi (2009 and later), Indy uses UnicodeString instead of AnsiString, so there is no ADestEncoding parameter present.

Upvotes: 6

Related Questions