user3809670
user3809670

Reputation: 15

Perl and HTML: UTF8 does not work in forms

I tried to change my Perl / HTML files to the UTF-8 format. Unfortunately I have a problem with forms. I created a little test script which exemplifies the problem. All it does is reload itself, so that the text entered will be shown again. It works fine with ASCII characters. As soon as I enter German "Umlaute" (ÄÖÜ) the characters get distorted. It cannot handle russian characters (ЭЯЮ) as well. Here is the script:

#!/usr/bin/perl

use utf8;
use Encode;
use open ':std', ':encoding(UTF-8)';

# Safe query-string in hash:
$querystring = $ENV{ 'QUERY_STRING' };
read(STDIN, $poststring, $ENV{CONTENT_LENGTH});
if (($querystring ne "") && ($poststring ne "")) { $querystring .= "&$poststring"; } 
    else { $querystring .= $poststring; }

$querystring =~ s/&/=/gi;
%query = split( /=/, $querystring );
foreach $key ( keys( %query ) ) {
    $query{$key} =~ tr/+/ /;
    $query{$key} =~ s/%([\da-f][\da-f])/chr( hex($1) )/egi;
    $uquer{$key} = decode_utf8( $query{$key} );
}

print "Content-Type: text/html; charset=\"UTF-8\"\n\n";
print <<END;
    <HTML>
        <HEAD>
            <META HTTP-EQUIV="Content-Type" content="text/html; charset=utf-8">
        </HEAD>
        <BODY>
            <FORM NAME="frmeing" METHOD="POST" ACTION="test0.cgi">
                <INPUT NAME="df_kurs" TYPE="TEXT" VALUE="$uquer{'df_kurs'}">
                <INPUT TYPE="SUBMIT">
            </FORM>
        </BODY>
    </HTML>
END

You can test this script as well. It is online at this address: http://project-website.org/test/test0.cgi Does anybody know what could be the problem? Thank you in advance for your help!

Upvotes: 1

Views: 1349

Answers (1)

ikegami
ikegami

Reputation: 385764

It's due to a bug in your version of decode_utf8.

$ perl -Mutf8 -MEncode -E'
   $u = $d = encode_utf8("é");
   utf8::upgrade($u);   # Changes how the string is stored internally
   say $u eq $d ?1:0;
   say decode_utf8($d) eq decode_utf8($u) ?1:0;
'
1
0

As you can see, $u and $d are equal, but your version of decode_utf8 decodes them differently. Specifically, it returns $u unchanged.

This has been fixed in newer versions of Encode. (2.53, I think.)

The easier way to address the problem is to fix your own bug. Using use open, you tell your program to decode STDIN from UTF-8 before unescaping the url-encoding and decoding from UTF-8 a second time.

Fix:

#!/usr/bin/perl

use utf8;                      # Source code is encoded using UTF-8.
use open ':encoding(UTF-8)';   # Set default encoding for file handles.
BEGIN { binmode(STDOUT, ':encoding(UTF-8)'); }  # HTML
BEGIN { binmode(STDERR, ':encoding(UTF-8)'); }  # Error log

use Encode;

# Safe query-string in hash:
$querystring = $ENV{ 'QUERY_STRING' };
read(STDIN, my $poststring, $ENV{CONTENT_LENGTH});
if (($querystring ne "") && ($poststring ne "")) { $querystring .= "&$poststring"; } 
    else { $querystring .= $poststring; }

$querystring =~ s/&/=/gi;
%query = split( /=/, $querystring );
foreach $key ( keys( %query ) ) {
    $query{$key} =~ tr/+/ /;
    $query{$key} =~ s/%([\da-f][\da-f])/chr( hex($1) )/egi;
    $uquer{$key} = decode_utf8( $query{$key} );
}

print "Content-Type: text/html; charset=\"UTF-8\"\n\n";
print <<END;
    <HTML>
        <HEAD>
            <META HTTP-EQUIV="Content-Type" content="text/html; charset=utf-8">
        </HEAD>
        <BODY>
            <FORM NAME="frmeing" METHOD="POST">
                <INPUT NAME="df_kurs" TYPE="TEXT" VALUE="$uquer{'df_kurs'}">
                <INPUT TYPE="SUBMIT">
            </FORM>
        </BODY>
    </HTML>
END

But you really should use CGI.pm.

#!/usr/bin/perl

use strict;    # Always!
use warnings;  # Always!

use utf8;                      # Source code is encoded using UTF-8.
use open ':encoding(UTF-8)';   # Set default encoding for file handles.
BEGIN { binmode(STDOUT, ':encoding(UTF-8)'); }  # HTML
BEGIN { binmode(STDERR, ':encoding(UTF-8)'); }  # Error log

use CGI qw( -utf8 );
use Encode;

my $cgi = CGI->new();
my %uquer = $cgi->Vars();

print $cgi->header('text/html; charset=UTF-8');
print <<END;
    <HTML>
        <HEAD>
            <META HTTP-EQUIV="Content-Type" content="text/html; charset=utf-8">
        </HEAD>
        <BODY>
            <FORM NAME="frmeing" METHOD="POST">
                <INPUT NAME="df_kurs" TYPE="TEXT" VALUE="$uquer{'df_kurs'}">
                <INPUT TYPE="SUBMIT">
            </FORM>
        </BODY>
    </HTML>
END

Upvotes: 6

Related Questions