perl how to detect corrupt data in CSV file?

Question

I download a CSV file from another server using perl script. After download I wish to check whether the file contains any corrupted data or not. I tried to use Encode::Detect::Detector to detect encoding but it returns 'undef' in both cases:

if the string is ASCII or
if the string is corrupted

So using the below program I can't differentiate between ASCII & Corrupted Data.

 use strict;
 use Text::CSV;
 use Encode::Detect::Detector;
 use XML::Simple;
 use Encode;
 require Encode::Detect;

 my @rows;
 my $init_file = "new-data-jp-2013-8-8.csv";



 my $csv = Text::CSV->new ( { binary => 1 } )
                 or die "Cannot use CSV: ".Text::CSV->error_diag ();

 open my $fh, $init_file or die $init_file.": $!";

 while ( my $row = $csv->getline( $fh ) ) {
     my @fields = @$row; # get line into array
     for (my $i=1; $i<=23; $i++){  # I already know that CSV file has 23 columns
            if ((Encode::Detect::Detector::detect($fields[$i-1])) eq undef){
                print "the encoding is undef in col".$i.
                            "  where field is ".$fields[$i-1].
                            " and its length is  ".length($fields[$i-1])." 
";
            }
            else {
            my $string = decode("Detect", $fields[$i-1]);
            print "this is string print  ".$string.
                    " the encoding is ".Encode::Detect::Detector::detect($fields[$i-1]).
                    " and its length is  ".length($fields[$i-1])."
";
            }
        }   
     }

amon · Accepted Answer

You have some bad assumptions about encodings, and some errors in your script.

foo() eq undef

does not make any sense. You cannot compare to string equality to undef, as undef isn't a string. It does, however, stringify to the empty string. You should use warnings to get error messages when you do such rubbish. To test whether a value is not undef, use defined:

unless(defined foo()) { .... }

The Encode::Detector::Detect module uses an object oriented interface. Therefore,

Encode::Detect::Detector::detect($foo)

is wrong. According to the docs, you should be doing

Encode::Detect::Detector->detect($foo)

You probably cannot do decoding on a field-by-field basis. Usually, one document has one encoding. You need to specify the encoding when opening the file handle, e.g.

use autodie;
open my $fh, "<:utf8", $init_file;

While CSV can support some degree of binary data (like encoded text), it isn't well suited for this purpose, and you may want to choose another data format.

Finally, ASCII data effectively does not need any de- or encoding. The undef result for encoding detection does make sense here. It cannot be asserted with certaincy that a document was encoded to ASCII (as many encodings are a superset of ASCII), but given a certain document it can be asserted that it isn't valid ASCII (i.e. has the 8th bit set) but must rather be a more complex encoding like Latin-1, UTF-8.

perl how to detect corrupt data in CSV file?

Answers (1)

Related Questions