Reputation: 33395

Character encoding messing up Perl regex

Short version: here is a minimal failing example:

$> echo xóx > /tmp/input
$> hex /tmp/input
0x00000000: 78 c3 b3 78 0a
$> perl -e 'open F, "<", "/tmp/input" or die $!;
       while(<F>) {
           if ($_=~/x(\w)x/) {
               print "Match:$1\n";
           }else{
               print "No match\n";
           }
       }'
No match

Why does this fail and how can I make the Perl script accept ó with \w?

Long version: I am scraping data from HTML using Perl (5.10). The end goal is to have strings represented exclusively be the ASCII printable set (0x20-0x7F). This will involve changing e.g. ó to ó and also by mapping certain characters to approximations, e.g. various spaces end up as 0x20 and a certain kind of apostophe (see later) should end up as plain old 0x27.

My quest began when "ó"=~/\W/ returned true, which suprised me because perldoc perlretut tells me

\w matches a word character (alphanumeric or _), not just [0-9a-zA-Z_] but also digits and characters from non-roman scripts

I figure it's something to do with the character encoding. I don't know a great deal about this, but the source HTML contains

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

and a hexdump tells me that ó is encoded as b3c3 and not f3 as I had first expected.

In Perl, I tried to fix this with open F, "<:encoding(UTF-8)", $f but this gives me errors such as

utf8 "\xF3" does not map to Unicode

and string s like \xF3 appear in the output from read. It got wierder when I noticed that some characters are encoded out-of-order which I don't understand at all. Here are two hexdumps (UNIX hexdump utility) for comparison:

Ralt => 61 52 74 6c

Réalt => c3 52 61 a9 74 6c

WTF?

Also, here's that damned apostrophe that I mentioned earlier.

Pats => 61 50 73 74

Pat’s => 61 50 e2 74 99 80

Here are my questions:

What's with the crazy out-of-order encoding?
Can I configure Perl to accept the above strings in regexes such as s/ó/ó/g ?
What can I do to transform e.g. Pat’s into Pat's and basically get it all into ASCII, with HTML entities for the usual accented vowels?

For part 2 I can confirm that my keyboard enters ó into the text editor using the same encoding as the files which are read in.

For part 3 it is not at all neccessary to stay within Perl. I also only need mappings for common punctuation like apostrophes. Any exotic characters with no obvious ASCII equivalents are unexpected and should simply trigger failure.

Upvotes: 2

Answers (2)

ikegami

Reputation: 385655

You take that string of bytes (the UTF-8 encoding of "xóx"), and you pass it to the regex engine which expects a string of Unicode code points. The UTF-8 encoding of "xóx" is 78 C3 B3 78 0A, which is "xÃ³x" when treated as Unicode code points.

You actually want to pass 78 F3 78 0A to the regex engine, and that can be obtained through a process called "decoding".

For your one-liner in a UTF-8 environment, you could use -CS:

perl -CSDA -ne'
    if (/x(\w)x/) {
        print "Match:$1\n";
    } else {
        print "No match\n";
    }
' /tmp/input

For a script, you could use binmode, perhaps via use open:

use utf8;                             # Source code is UTF-8
use open ':std', ':encoding(UTF-8)';  # Set encoding for STD*
use open IO => ':encoding(UTF-8)';    # Default encoding for files

while (<>) {
    if (/x(\w)x/) {
        print "Match:$1\n";
    } else {
        print "No match\n";
    }
}

Always decode your inputs. Always encode your outputs.

As for your other question, you can use HTML::Entities to convert the text into HTML entities (once you've decoded it).

Note that it's kinda silly to encode characters other than «&», «<», «>», «"» and «'» (and not even all of those are needed) since you use

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

Upvotes: 0

daxim

Reputation: 39158

Your hexdumper sucks. Use a proper one.

$ echo -n Réalt | hex
0000  52 c3 a9 61 6c 74                                 R..alt
$ echo -n Pat’s | hex
0000  50 61 74 e2 80 99 73                              Pat...s

Yes, the configuration is use utf8;, so that a literal ó in the Perl source code is treated as a character. s/ó/ó/g works just fine, but you should use a module to deal with entities as below.

    use utf8;
    use HTML::Entities qw(encode_entities);

    encode_entities 'Réalt';    # returns 'R&eacute;alt'
    encode_entities 'Pat’s';    # returns 'Pat&rsquo;s'

Read http://p3rl.org/UNI to learn about the topic of encoding in Perl.

Upvotes: 3

Character encoding messing up Perl regex

Here are my questions:

Answers (2)

Related Questions