Jasio
Jasio

Reputation: 275

How to recognise encoding of the input stream automatically [csv files]

I have an old set of CSV files which were created using incompatible encodings, including utf-8 and iso 8859-2. Now I am importing them to the database, and of course I would like to make, say "krzesło" recognised as such regardless of the original encoding. If they all were utf files, it would be straightforward: I've already found Text::CSV and Text::CSV::Encoded modules, and for utf files it all worked like a snap.

The issue is, some files are encoded in 8859-2 8-bit encoding, and if I try to blindly replace the characters with their utf representation, I may spoil the utf encoding, if the line was already encoded in utf.

I thought about identifying encoding on file level and converting the files prior to importing them, but the files are not mine, I still receive the new data and I am not sure if it's guaranteed that the future files are all utf encoded.

A general algorithm of my program is as follows:

use utf8;
use Encode qw(encode decode);
use open ':std', ':encoding(UTF-8)';

my $csv = Text::CSV::Encoded->new ( 
{ 
  encoding_in      => "utf8", 
  encoding_out     => "utf8",  
  binary        => 0,
  sep_char      => ';',
  eol       => $/ 
} )  # should set binary attribute.
            or die "Cannot use CSV: ".Text::CSV->error_diag ();

while (<>) {
  if ($_ not in utf) { convert $_ to utf }
  if ($csv->parse($_)) {
    #
    # further field-level processing
    #
  }
}

Upvotes: 3

Views: 226

Answers (1)

G. Cito
G. Cito

Reputation: 6378

You could try Encode::Detective. It can be used as follows in a one-liner:

perl -00 -MEncode::Detective=detect -E'open my $fh, "<", "file.csv" ; 
 my $content = <$fh>; $enc = detect ($content); say $enc'

It should not be too difficult to fit that into your script.

Upvotes: 2

Related Questions