Reputation: 275
I have an old set of CSV files which were created using incompatible encodings, including utf-8 and iso 8859-2. Now I am importing them to the database, and of course I would like to make, say "krzesło" recognised as such regardless of the original encoding. If they all were utf files, it would be straightforward: I've already found Text::CSV and Text::CSV::Encoded modules, and for utf files it all worked like a snap.
The issue is, some files are encoded in 8859-2 8-bit encoding, and if I try to blindly replace the characters with their utf representation, I may spoil the utf encoding, if the line was already encoded in utf.
I thought about identifying encoding on file level and converting the files prior to importing them, but the files are not mine, I still receive the new data and I am not sure if it's guaranteed that the future files are all utf encoded.
A general algorithm of my program is as follows:
use utf8;
use Encode qw(encode decode);
use open ':std', ':encoding(UTF-8)';
my $csv = Text::CSV::Encoded->new (
{
encoding_in => "utf8",
encoding_out => "utf8",
binary => 0,
sep_char => ';',
eol => $/
} ) # should set binary attribute.
or die "Cannot use CSV: ".Text::CSV->error_diag ();
while (<>) {
if ($_ not in utf) { convert $_ to utf }
if ($csv->parse($_)) {
#
# further field-level processing
#
}
}
Upvotes: 3
Views: 226
Reputation: 6378
You could try Encode::Detective
. It can be used as follows in a one-liner:
perl -00 -MEncode::Detective=detect -E'open my $fh, "<", "file.csv" ;
my $content = <$fh>; $enc = detect ($content); say $enc'
It should not be too difficult to fit that into your script.
Upvotes: 2