Reputation: 33395
Short version: here is a minimal failing example:
$> echo xóx > /tmp/input
$> hex /tmp/input
0x00000000: 78 c3 b3 78 0a
$> perl -e 'open F, "<", "/tmp/input" or die $!;
while(<F>) {
if ($_=~/x(\w)x/) {
print "Match:$1\n";
}else{
print "No match\n";
}
}'
No match
Why does this fail and how can I make the Perl script accept ó with \w
?
Long version: I am scraping data from HTML using Perl (5.10). The end goal is to have strings represented exclusively be the ASCII printable set (0x20-0x7F). This will involve changing e.g. ó to ó and also by mapping certain characters to approximations, e.g. various spaces end up as 0x20
and a certain kind of apostophe (see later) should end up as plain old 0x27
.
My quest began when "ó"=~/\W/ returned true, which suprised me because perldoc perlretut
tells me
\w matches a word character (alphanumeric or
_
), not just [0-9a-zA-Z_] but also digits and characters from non-roman scripts
I figure it's something to do with the character encoding. I don't know a great deal about this, but the source HTML contains
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
and a hexdump tells me that ó is encoded as b3c3
and not f3
as I had first expected.
In Perl, I tried to fix this with open F, "<:encoding(UTF-8)", $f
but this gives me errors such as
utf8 "\xF3" does not map to Unicode
and string s like \xF3
appear in the output from read
. It got wierder when I noticed that some characters are encoded out-of-order which I don't understand at all. Here are two hexdumps (UNIX hexdump
utility) for comparison:
Ralt => 61 52 74 6c
Réalt => c3 52 61 a9 74 6c
WTF?
Also, here's that damned apostrophe that I mentioned earlier.
Pats => 61 50 73 74
Pat’s => 61 50 e2 74 99 80
For part 2 I can confirm that my keyboard enters ó into the text editor using the same encoding as the files which are read in.
For part 3 it is not at all neccessary to stay within Perl. I also only need mappings for common punctuation like apostrophes. Any exotic characters with no obvious ASCII equivalents are unexpected and should simply trigger failure.
Upvotes: 2
Views: 1797
Reputation: 385655
You take that string of bytes (the UTF-8 encoding of "xóx
"), and you pass it to the regex engine which expects a string of Unicode code points. The UTF-8 encoding of "xóx
" is 78 C3 B3 78 0A
, which is "xóx
" when treated as Unicode code points.
You actually want to pass 78 F3 78 0A
to the regex engine, and that can be obtained through a process called "decoding".
For your one-liner in a UTF-8 environment, you could use -CS
:
perl -CSDA -ne'
if (/x(\w)x/) {
print "Match:$1\n";
} else {
print "No match\n";
}
' /tmp/input
For a script, you could use binmode
, perhaps via use open
:
use utf8; # Source code is UTF-8
use open ':std', ':encoding(UTF-8)'; # Set encoding for STD*
use open IO => ':encoding(UTF-8)'; # Default encoding for files
while (<>) {
if (/x(\w)x/) {
print "Match:$1\n";
} else {
print "No match\n";
}
}
Always decode your inputs. Always encode your outputs.
As for your other question, you can use HTML::Entities to convert the text into HTML entities (once you've decoded it).
Note that it's kinda silly to encode characters other than «&
», «<
», «>
», «"
» and «'
» (and not even all of those are needed) since you use
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
Upvotes: 0
Reputation: 39158
Your hexdumper sucks. Use a proper one.
$ echo -n Réalt | hex
0000 52 c3 a9 61 6c 74 R..alt
$ echo -n Pat’s | hex
0000 50 61 74 e2 80 99 73 Pat...s
Yes, the configuration is use utf8;
, so that a literal ó
in the Perl source code is treated as a character. s/ó/ó/g
works just fine, but you should use a module to deal with entities as below.
3.
use utf8;
use HTML::Entities qw(encode_entities);
encode_entities 'Réalt'; # returns 'Réalt'
encode_entities 'Pat’s'; # returns 'Pat’s'
Read http://p3rl.org/UNI to learn about the topic of encoding in Perl.
Upvotes: 3