Encode::Guess:guess_encoding gives different results in different contexts

Question

I have the following sub that opens a text file and attempts to ensure its encoding is one of either UTF-8, ISO-8859-15 or ASCII.

The problem I have with it is different behaviours in interactive vs. non-interactive use.

when I run interactively with a file that contains a UTF-8 line, $decoder is, as expected, a reference object whose name returns utf8 for that line.
non-interactively (as it runs as part of a subversion commit hook) guess_encoding returns a scalar string of value utf8 or iso-8859-15 for the utf8 check line, and iso-8859-15 or utf8 for the other two lines.

I can't for the life of me, work out where the difference in behaviour is coming from. If I force the encoding of the open to say <:encoding(utf8), it accepts every line as UTF-8 without question.

The problem is I can't assume that every file it receives will be UTF-8, so I don't want to force the encoding as a work-around. Another potential workaround is to parse the scalar text, but that just seems messy, especially when it seems to work correctly in an interactive context.

From the shell, I've tried overriding $LANG (as non-interactively that isn't set, nor are any of the LC_ variables), however the interactive version still runs correctly.

The commented out line that reports $Encode::Guess::NoUTFAutoGuess returns 0 in both interactive and non-interactive use when commented in.

Ultimately, the one thing we're trying to prevent is having UTF-16 or other wide-char encodings in our repository (as some of our tooling doesn't play well with it): I thought that looking for a white-list of encodings is an easier job than looking for a black-list of encodings.

sub checkEncoding
{
    my ($file) = @_;

    my ($b1, $b2, $b3);
    my $encoding = "";
    my $retval = 1;
    my $line = 0;

    say("Checking encoding of $file");
    #say($Encode::Guess::NoUTFAutoGuess);
    open (GREPFILE, "<", $file);
    while () {
            chomp($_);
            $line++;

            my $decoder = Encode::Guess::guess_encoding($_, 'utf8');
            say("A: $decoder");
            $decoder = Encode::Guess::guess_encoding($_, 'iso-8859-15') unless ref $decoder;
            say("B: $decoder");
            $decoder = Encode::Guess::guess_encoding($_, 'ascii') unless ref $decoder;
            say("C: $decoder");

            if (ref $decoder) {
                    $encoding = $decoder->name;
            } else {
                    say "Mis-identified encoding '$decoder' on line $line: [$_]";
                    my $z = unpack('H*', $_);
                    say $z;
                    $encoding = $decoder;
                    $retval = 0;
            }

            last if ($retval == 0);
    }
    close GREPFILE;

    return $retval;
}

Encode::Guess:guess_encoding gives different results in different contexts

Answers (1)

Related Questions