ynka
ynka

Reputation: 1497

How to clear non-utf characters while reading a utf-8 file in Perl?

I am parsing a very large log file with Perl. The code is:

open($input_handle, '<:encoding(UTF-8)', $input_file);    

while (<$input_handle>)  {                   
...
}
close($input_handle);    

However, sometimes the log file contains faulty characters, and I get the following message:

utf8 "\xD0" does not map to Unicode at log_parser.pl line 32, <$input_handle> line 10920.

I am aware of the characters and I would just like to ignore them without the log message flooding my (Windows!) build server logs. I tried no warnings 'utf8'; but it did not help.

How can I suppress the message?

Upvotes: 1

Views: 1411

Answers (1)

ikegami
ikegami

Reputation: 385546

You could do the decoding yourself instead of using the :encoding layer. By default, Encode's decode and decode_utf8 simply exchange the bad character with U+FFFD rather than warning.

$ perl -e'
   use Encode qw( decode_utf8 );
   $bytes = "\xD0 \x92 \xD0\x92\n";
   $text = decode_utf8($bytes);
   printf("U+%v04X\n", $text);
'
U+FFFD.0020.FFFD.0020.0412.000A

If the file is a mix of UTF-8, iso-8859-1 and cp1252, it may be possible to fix the file rather than simply silencing the errors, as detailed here.

Upvotes: 3

Related Questions