Reputation: 38651
I have seen How do I read UTF-8 with diamond operator (<>)?, unfortunately nothing in there helps me. Consider this code, test.pl
:
use 5.010;
use warnings;
use strict;
use utf8; # tell Perl that your script is written in UTF-8?
binmode(STDOUT, ":raw");
binmode(STDIN, ":raw");
#~ use open IO => ':raw'; # nope
use open qw(:std :utf8); # nope?
use Data::Dumper;
my $reValidLine = qr/^[├│└]/;
my $line2 = "│ ├── [dr";
my @matches = $line2 =~ $reValidLine;
print Dumper(\@matches);
while(<>) {
binmode ARGV, ':utf8';
my $line = $_;
my @imatches = $line =~ $reValidLine;
print Dumper(\@imatches);
}
If I call this from bash
command line, I get this:
$ echo "│ ├── [dr" | perl test.pl
$VAR1 = [
1
];
$VAR1 = [];
Note that I'm piping (via echo
) into perl's stdin, the exact same string which is $line2
in the perl code; and the very same regex matches $line2
- but does not match the same string when it comes from stdin?
Just for confirmation, here is what hexdump
and utfinfo.pl report in the very same shell:
$ echo "│ ├── [dr" | hexdump -C
00000000 e2 94 82 c2 a0 c2 a0 20 e2 94 9c e2 94 80 e2 94 |....... ........|
00000010 80 20 5b 64 72 0a |. [dr.|
00000016
$ echo "│ ├── [dr" | perl utfinfo.pl
Got 11 uchars
Char: '│' u: 9474 [0x2502] b: 226,148,130 [0xE2,0x94,0x82] n: BOX DRAWINGS LIGHT VERTICAL [Box Drawing]
Char: ' ' u: 160 [0x00A0] b: 194,160 [0xC2,0xA0] n: NO-BREAK SPACE [Latin-1 Supplement]
Char: ' ' u: 160 [0x00A0] b: 194,160 [0xC2,0xA0] n: NO-BREAK SPACE [Latin-1 Supplement]
Char: ' ' u: 32 [0x0020] b: 32 [0x20] n: SPACE [Basic Latin]
Char: '├' u: 9500 [0x251C] b: 226,148,156 [0xE2,0x94,0x9C] n: BOX DRAWINGS LIGHT VERTICAL AND RIGHT [Box Drawing]
Char: '─' u: 9472 [0x2500] b: 226,148,128 [0xE2,0x94,0x80] n: BOX DRAWINGS LIGHT HORIZONTAL [Box Drawing]
Char: '─' u: 9472 [0x2500] b: 226,148,128 [0xE2,0x94,0x80] n: BOX DRAWINGS LIGHT HORIZONTAL [Box Drawing]
Char: ' ' u: 32 [0x0020] b: 32 [0x20] n: SPACE [Basic Latin]
Char: '[' u: 91 [0x005B] b: 91 [0x5B] n: LEFT SQUARE BRACKET [Basic Latin]
Char: 'd' u: 100 [0x0064] b: 100 [0x64] n: LATIN SMALL LETTER D [Basic Latin]
Char: 'r' u: 114 [0x0072] b: 114 [0x72] n: LATIN SMALL LETTER R [Basic Latin]
So, both of them confirm the same bytes as for the right utf-8 characters.
Then, why doesn't the Perl regex match the string when its piped from stdin, and how do I get it to match?
Upvotes: 1
Views: 473
Reputation: 13664
The problem with doing this inside your while
loop:
binmode ARGV, ':utf8';
is that it's too late. By the time that binmode
has executed, you've already read the first line from the filehandle. (And this particular filehandle only has one line!)
Try adding a new line character to the input and you'll see that the binmode
does actually work for the subsequent lines.
echo "\n│ ├── [dr" | perl test.pl
Lifting binmode ARGV, ':utf8'
out of the while
loop won't work though because at that point the special ARGV
filehandle won't have been opened.
Personally I'd solve this by reading the handle in raw
mode and using the Encode module to decode the UTF-8.
Upvotes: 1
Reputation: 38651
Well, got it to work with this change:
binmode(STDOUT, ":utf8");
binmode(STDIN, ":utf8");
... now I have:
$ echo "│ ├── [dr" | perl test.pl
$VAR1 = [
1
];
$VAR1 = [
1
];
... which is what I expected; unfortunately I cannot provide much understanding about reasons behind this :)
Still, hope it might help someone...
Upvotes: 1