spraff
spraff

Reputation: 33395

Character encoding messing up Perl regex

Short version: here is a minimal failing example:

$> echo xóx > /tmp/input
$> hex /tmp/input
0x00000000: 78 c3 b3 78 0a
$> perl -e 'open F, "<", "/tmp/input" or die $!;
       while(<F>) {
           if ($_=~/x(\w)x/) {
               print "Match:$1\n";
           }else{
               print "No match\n";
           }
       }'
No match

Why does this fail and how can I make the Perl script accept ó with \w?


Long version: I am scraping data from HTML using Perl (5.10). The end goal is to have strings represented exclusively be the ASCII printable set (0x20-0x7F). This will involve changing e.g. ó to &oacute; and also by mapping certain characters to approximations, e.g. various spaces end up as 0x20 and a certain kind of apostophe (see later) should end up as plain old 0x27.

My quest began when "ó"=~/\W/ returned true, which suprised me because perldoc perlretut tells me

\w matches a word character (alphanumeric or _), not just [0-9a-zA-Z_] but also digits and characters from non-roman scripts

I figure it's something to do with the character encoding. I don't know a great deal about this, but the source HTML contains

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

and a hexdump tells me that ó is encoded as b3c3 and not f3 as I had first expected.

In Perl, I tried to fix this with open F, "<:encoding(UTF-8)", $f but this gives me errors such as

utf8 "\xF3" does not map to Unicode

and string s like \xF3 appear in the output from read. It got wierder when I noticed that some characters are encoded out-of-order which I don't understand at all. Here are two hexdumps (UNIX hexdump utility) for comparison:

Ralt => 61 52 74 6c

Réalt => c3 52 61 a9 74 6c

WTF?

Also, here's that damned apostrophe that I mentioned earlier.

Pats => 61 50 73 74

Pat’s => 61 50 e2 74 99 80

Here are my questions:

  1. What's with the crazy out-of-order encoding?
  2. Can I configure Perl to accept the above strings in regexes such as s/ó/&oacute;/g ?
  3. What can I do to transform e.g. Pat’s into Pat's and basically get it all into ASCII, with HTML entities for the usual accented vowels?

For part 2 I can confirm that my keyboard enters ó into the text editor using the same encoding as the files which are read in.

For part 3 it is not at all neccessary to stay within Perl. I also only need mappings for common punctuation like apostrophes. Any exotic characters with no obvious ASCII equivalents are unexpected and should simply trigger failure.

Upvotes: 2

Views: 1797

Answers (2)

ikegami
ikegami

Reputation: 385655

You take that string of bytes (the UTF-8 encoding of "xóx"), and you pass it to the regex engine which expects a string of Unicode code points. The UTF-8 encoding of "xóx" is 78 C3 B3 78 0A, which is "xóx" when treated as Unicode code points.

You actually want to pass 78 F3 78 0A to the regex engine, and that can be obtained through a process called "decoding".

For your one-liner in a UTF-8 environment, you could use -CS:

perl -CSDA -ne'
    if (/x(\w)x/) {
        print "Match:$1\n";
    } else {
        print "No match\n";
    }
' /tmp/input

For a script, you could use binmode, perhaps via use open:

use utf8;                             # Source code is UTF-8
use open ':std', ':encoding(UTF-8)';  # Set encoding for STD*
use open IO => ':encoding(UTF-8)';    # Default encoding for files

while (<>) {
    if (/x(\w)x/) {
        print "Match:$1\n";
    } else {
        print "No match\n";
    }
}

Always decode your inputs. Always encode your outputs.


As for your other question, you can use HTML::Entities to convert the text into HTML entities (once you've decoded it).

Note that it's kinda silly to encode characters other than «&», «<», «>», «"» and «'» (and not even all of those are needed) since you use

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

Upvotes: 0

daxim
daxim

Reputation: 39158

  1. Your hexdumper sucks. Use a proper one.

    $ echo -n Réalt | hex
    0000  52 c3 a9 61 6c 74                                 R..alt
    $ echo -n Pat’s | hex
    0000  50 61 74 e2 80 99 73                              Pat...s
    
  2. Yes, the configuration is use utf8;, so that a literal ó in the Perl source code is treated as a character. s/ó/&oacute;/g works just fine, but you should use a module to deal with entities as below.

3.

    use utf8;
    use HTML::Entities qw(encode_entities);

    encode_entities 'Réalt';    # returns 'R&eacute;alt'
    encode_entities 'Pat’s';    # returns 'Pat&rsquo;s'

Read http://p3rl.org/UNI to learn about the topic of encoding in Perl.

Upvotes: 3

Related Questions