Unicode error only when code run with -n flag at the command-line

Question

The following simple script (basically) slurps in its input, splits it according to a regexp, replaces all newlines in each element of the resulting list, and prints out the modified elements one-by-one:

# demo.pl
use strict;
use utf8;
use open qw(:std :utf8);
use warnings qw(FATAL utf8);

BEGIN { $/ = $\ = undef; }

while ( <> ) {
  s/
\z//;
  s/
/\n/g, print "$_
" for split /
(?=[^\W\d]\w*=)/;
}

When given an input file (INPUTFILE) with the following (UTF8-encoded) contents as its argument

A=42
ΦΡΩΒΩΖΖ=ABCDEFGHIJKLMNOPQRSTUVWXYZ
_B_C_D12=
foo
345bar=nope
baz
  =whatever=
X_Y_Z=quux

...it prints out the desired output, namely:

% perl demo.pl INPUTFILE
A=42
ΦΡΩΒΩΖΖ=ABCDEFGHIJKLMNOPQRSTUVWXYZ
_B_C_D12=
foo
345bar=nope
baz
  =whatever=
X_Y_Z=quux

In contrast, the following almost identical CLI one-liner

% perl -ne 'use strict; use utf8; use open qw(:std :utf8); use warnings qw(FATAL utf8); BEGIN { $/ = $\ = undef; } s/
\z//; s/
/\n/g, print "$_
" for split /
(?=[^\W\d]\w*=)/;' INPUTFILE

...produces the following for the same input file

A=42
Î¦Î¡Î©ÎÎ©ÎÎ=ABCDEFGHIJKLMNOPQRSTUVWXYZ
_B_C_D12=
foo
345bar=nope
baz
  =whatever=
X_Y_Z=quux

There are (apparently) two problems here:

the regular expression fails to separate the first and second items;
the output contains an illegible substring.

(I expect that both problems will have the same underlying cause.)

The only difference between the "in-file script" (demo.pl) and the CLI one-liner is that the former explicitly wraps the body of the script with while ( <> ) { ... }, whereas for the latter, the -n flag causes this wrapper to be inserted automatically.

Q: How must the one-liner above be modified so that it produces the desired result with the -n flag?

BTW, not surprisingly, the exact command-line equivalent of demo.pl (without the -n flag), namely

% perl -e 'use strict; use utf8; use open qw(:std :utf8); use warnings qw(FATAL utf8); BEGIN { $/ = $\ = undef; } while ( <> ) { s/
\z//; s/
/\n/g, print "$_
" for split /
(?=[^\W\d]\w*=)/; }' INPUTFILE

also produces the desired output.

So the problem, whatever it is, has something to do with the -n flag.

FWIW:

% perl -v | head -2

This is perl 5, version 20, subversion 2 (v5.20.2) built for x86_64-linux-gnu-thread-multi

EDIT: One more clue: if the input to the failing one-liner is passed through STDIN rather than as a filename in @ARGV (e.g. replace INPUTFILE with < INPUTFILE), then it produces ~~the desired output~~ fully legible, though still incorrect, output:

A=42
ΦΡΩΒΩΖΖ=ABCDEFGHIJKLMNOPQRSTUVWXYZ
_B_C_D12=
foo
345bar=nope
baz
  =whatever=
X_Y_Z=quux

My current guess is that use open qw(:std :utf8) does not cover the input stream that <> reads from when the input is passed as a filename in @ARGV.

Michael Homer · Accepted Answer

The only difference between the "in-file script" (demo.pl) and the CLI one-liner is that the former explicitly wraps the body of the script with while ( <> ) { ... }, whereas for the latter, the -n flag causes this wrapper to be inserted automatically.

Yes, exactly — -n wraps all the code in while (<>) { ... }. That includes your use utf8; and use open(:utf8); lines, so the file is already open by the time you enable Unicode.

You can easily validate this by running the equivalent program to the -n version:

while (<>) {
    # demo.pl
    use strict;
    use utf8;
    use open qw(:std :utf8);
    use warnings qw(FATAL utf8);
    BEGIN { $/ = $\ = undef; }
    s/
\z//;
    s/
/\n/g, print "$_
" for split /
(?=[^\W\d]\w*=)/;
}

and seeing the same effect.

More interestingly, you can see that the use declarations do still have an effect: run the exact same input file through twice

perl demo.pl INPUTFILE INPUTFILE

and you get two outputs, the first one broken, the second one correct. That also happens with your one-liner.

You can enable UTF-8 for input by default using the -C flag with the i (8) option:

perl -CiO -ne 'use strict; use utf8; BEGIN { $/ = $\ = undef; } s/
\z//; s/
/\n/g, print "$_
" for split /
(?=[^\W\d]\w*=)/;' INPUTFILE

That ensures that UTF-8 is enabled before the file is opened, and you get the correct output. The O enables UTF-8 for standard output as well, so that you can print it.

Unicode error only when code run with -n flag at the command-line

Answers (1)

Related Questions