Reputation: 38651

UTF-8 with diamond operator (<>), part II (stdin and regex)

I have seen How do I read UTF-8 with diamond operator (<>)?, unfortunately nothing in there helps me. Consider this code, test.pl:

use 5.010;
use warnings;
use strict;
use utf8; # tell Perl that your script is written in UTF-8?
binmode(STDOUT, ":raw");
binmode(STDIN, ":raw");
#~ use open IO  => ':raw'; # nope
use open qw(:std :utf8); # nope?
use Data::Dumper;

my $reValidLine = qr/^[├│└]/;

my $line2 = "│   ├── [dr";
my @matches = $line2 =~ $reValidLine;
print Dumper(\@matches);

while(<>) {
  binmode ARGV, ':utf8';
  my $line = $_;
  my @imatches = $line =~ $reValidLine;
  print Dumper(\@imatches);
}

If I call this from bash command line, I get this:

$ echo "│   ├── [dr" | perl test.pl
$VAR1 = [
          1
        ];
$VAR1 = [];

Note that I'm piping (via echo) into perl's stdin, the exact same string which is $line2 in the perl code; and the very same regex matches $line2 - but does not match the same string when it comes from stdin?

Just for confirmation, here is what hexdump and utfinfo.pl report in the very same shell:

$ echo "│   ├── [dr" | hexdump -C
00000000  e2 94 82 c2 a0 c2 a0 20  e2 94 9c e2 94 80 e2 94  |....... ........|
00000010  80 20 5b 64 72 0a                                 |. [dr.|
00000016

$ echo "│   ├── [dr" | perl utfinfo.pl 
Got 11 uchars
Char: '│' u: 9474 [0x2502] b: 226,148,130 [0xE2,0x94,0x82] n: BOX DRAWINGS LIGHT VERTICAL [Box Drawing]
Char: ' ' u: 160 [0x00A0] b: 194,160 [0xC2,0xA0] n: NO-BREAK SPACE [Latin-1 Supplement]
Char: ' ' u: 160 [0x00A0] b: 194,160 [0xC2,0xA0] n: NO-BREAK SPACE [Latin-1 Supplement]
Char: ' ' u: 32 [0x0020] b: 32 [0x20] n: SPACE [Basic Latin]
Char: '├' u: 9500 [0x251C] b: 226,148,156 [0xE2,0x94,0x9C] n: BOX DRAWINGS LIGHT VERTICAL AND RIGHT [Box Drawing]
Char: '─' u: 9472 [0x2500] b: 226,148,128 [0xE2,0x94,0x80] n: BOX DRAWINGS LIGHT HORIZONTAL [Box Drawing]
Char: '─' u: 9472 [0x2500] b: 226,148,128 [0xE2,0x94,0x80] n: BOX DRAWINGS LIGHT HORIZONTAL [Box Drawing]
Char: ' ' u: 32 [0x0020] b: 32 [0x20] n: SPACE [Basic Latin]
Char: '[' u: 91 [0x005B] b: 91 [0x5B] n: LEFT SQUARE BRACKET [Basic Latin]
Char: 'd' u: 100 [0x0064] b: 100 [0x64] n: LATIN SMALL LETTER D [Basic Latin]
Char: 'r' u: 114 [0x0072] b: 114 [0x72] n: LATIN SMALL LETTER R [Basic Latin]

So, both of them confirm the same bytes as for the right utf-8 characters.

Then, why doesn't the Perl regex match the string when its piped from stdin, and how do I get it to match?

Upvotes: 1

Answers (2)

tobyink

Reputation: 13664

The problem with doing this inside your while loop:

binmode ARGV, ':utf8';

is that it's too late. By the time that binmode has executed, you've already read the first line from the filehandle. (And this particular filehandle only has one line!)

Try adding a new line character to the input and you'll see that the binmode does actually work for the subsequent lines.

echo "\n│   ├── [dr" | perl test.pl

Lifting binmode ARGV, ':utf8' out of the while loop won't work though because at that point the special ARGV filehandle won't have been opened.

Personally I'd solve this by reading the handle in raw mode and using the Encode module to decode the UTF-8.

Upvotes: 1

sdaau

Reputation: 38651

Well, got it to work with this change:

binmode(STDOUT, ":utf8");
binmode(STDIN, ":utf8");

... now I have:

$ echo "│   ├── [dr" | perl test.pl
$VAR1 = [
          1
        ];
$VAR1 = [
          1
        ];

... which is what I expected; unfortunately I cannot provide much understanding about reasons behind this :) Still, hope it might help someone...

Upvotes: 1

UTF-8 with diamond operator (&lt;&gt;), part II (stdin and regex)

Answers (2)

Related Questions

UTF-8 with diamond operator (<>), part II (stdin and regex)