Reputation: 31316
I have the following sub that opens a text file and attempts to ensure its encoding is one of either UTF-8, ISO-8859-15 or ASCII.
The problem I have with it is different behaviours in interactive vs. non-interactive use.
when I run interactively with a file that contains a UTF-8 line, $decoder
is, as expected, a reference object whose name
returns utf8 for that line.
non-interactively (as it runs as part of a subversion commit hook) guess_encoding
returns a scalar string of value utf8 or iso-8859-15
for the utf8
check line, and iso-8859-15 or utf8
for the other two lines.
I can't for the life of me, work out where the difference in behaviour is coming from. If I force the encoding of the open
to say <:encoding(utf8)
, it accepts every line as UTF-8 without question.
The problem is I can't assume that every file it receives will be UTF-8, so I don't want to force the encoding as a work-around. Another potential workaround is to parse the scalar text, but that just seems messy, especially when it seems to work correctly in an interactive context.
From the shell, I've tried overriding $LANG
(as non-interactively that isn't set, nor are any of the LC_
variables), however the interactive version still runs correctly.
The commented out line that reports $Encode::Guess::NoUTFAutoGuess
returns 0 in both interactive and non-interactive use when commented in.
Ultimately, the one thing we're trying to prevent is having UTF-16 or other wide-char encodings in our repository (as some of our tooling doesn't play well with it): I thought that looking for a white-list of encodings is an easier job than looking for a black-list of encodings.
sub checkEncoding
{
my ($file) = @_;
my ($b1, $b2, $b3);
my $encoding = "";
my $retval = 1;
my $line = 0;
say("Checking encoding of $file");
#say($Encode::Guess::NoUTFAutoGuess);
open (GREPFILE, "<", $file);
while (<GREPFILE>) {
chomp($_);
$line++;
my $decoder = Encode::Guess::guess_encoding($_, 'utf8');
say("A: $decoder");
$decoder = Encode::Guess::guess_encoding($_, 'iso-8859-15') unless ref $decoder;
say("B: $decoder");
$decoder = Encode::Guess::guess_encoding($_, 'ascii') unless ref $decoder;
say("C: $decoder");
if (ref $decoder) {
$encoding = $decoder->name;
} else {
say "Mis-identified encoding '$decoder' on line $line: [$_]";
my $z = unpack('H*', $_);
say $z;
$encoding = $decoder;
$retval = 0;
}
last if ($retval == 0);
}
close GREPFILE;
return $retval;
}
Upvotes: 1
Views: 353
Reputation: 386331
No need to guess. For the specific options of UTF-8, ISO-8859-1 and US-ASCII, you can use Encoding::FixLatin's fix_latin
. It's virtually guaranteed to succeed.
That said, I think the use of ISO-8859-1 in the OP is a typo for ISO-8859-15.
The method used by fix_latin
would work just as well for ISO-8859-15 as it does for ISO-8859-1. It's simply a question of replacing _init_byte_map
with the following:
sub _init_byte_map {
foreach my $i (0x80..0xFF) {
my $byte = chr($i);
my $utf8 = Encode::from_to($byte, 'iso-8859-15', 'UTF-8');
$byte_map->{$byte} = $utf8;
}
}
Alternatively, if you're willing to assume the data is all of one encoding or another (rather than a mix), you could also use the following approach:
my $text;
if (!eval {
$text = decode("UTF-8", $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC);
1 # No exception
}) {
$text = decode("ISO-8859-15", $bytes);
}
Keep in mind that US-ASCII is a proper subset of both UTF-8 and ISO-8859-15, so it doesn't need to be handled specially.
Upvotes: 1