jdamae
jdamae

Reputation: 3909

parsing out string in csv file type using perl

I am using perl's Tie::File to parse through a .csv file and matching for a specific string, its actually the first string/header on the file.

The problem I am having might be my input file type. The tool that exports the data file can export in .csv or text which I tried and tested both.

Somehow, I am still NOT getting the match. My problem could be two-fold: (1) my regex is wrong and/or (2) the file type.

Sample file header/string (if I cat the file):

??Global  Mail_Date.Dat

Sample file header/string (if I open up in editor, apple's TextEdit.app)

Global  Mail_Date.Dat

Here's the octal dump:

0000000 377 376   G  \0   l  \0   o  \0   b  \0   a  \0   l  \0      \0
        feff 0047 006c 006f 0062 0061 006c 0020
0000020      \0   M  \0   a  \0   i  \0   l  \0   _  \0   D  \0   a  \0
        0020 004d 0061 0069 006c 005f 0044 0061
0000040   t  \0   e  \0   .  \0   D  \0   a  \0   t  \0  \r  \0  \n  \0
        0074 0065 002e 0044 0061 0074 000d 000a

Obviously, doing an os cat shows a leading ?? on the string.

Code:

use strict;
use warnings;
use Tie::File;
use File::Copy;

    for (@ARGV) {
        tie my @lines, "Tie::File", $_;             
        #shift @lines if $lines[0] =~ /^Global/;
        if ($lines[0] =~ /^Global/) 
        {
             print "We have a match, remove the line ..";
             #shift @lines if $lines[0] =~ /^Global/;
             untie @lines; 
        }
        else
        { 
             print "No match found. Exit";
        }

}

Upvotes: 2

Views: 564

Answers (2)

David W.
David W.

Reputation: 107040

I'm looking at the octal dump and notice the null character between each of your regular characters. That is, it's G-\0-l-\0-o-\0-b-\0-a-\0-l-\0 and not G-l-o-b-a-l. This means your file is not in ASCII text. Is this in UTF8 or UTF16? If so, you have to use the encoding function when you open the file in Perl:

open(my $fh, "<:encoding(UTF-16)", $fileName)
    or die qq(Can't open file "$fileName" for reading);

If this is a csv file, you should try the Text::CSV::Encoded module. This will help you parse your CSV file.

Upvotes: 1

ErikR
ErikR

Reputation: 52029

It looks like your file is encoded in utf16.

Try something like this:

binmode STDIN, ':encoding(UTF-16LE)';
while (<STDIN>) {
  if (m/Global/) {  # see note
    print "Matched Global on line $.\n";
  }
}

If you get a match then at least we know the encoding is correct.

To compensate for the BOM code-point, you could read in a single character after the binmode call:

binmode STDIN, ':encodeing(UTF-16LE)';
read(STDIN, my $buf, 1);
while (<STDIN>) {
  if (m/^Global/) { ... }
}

Upvotes: 1

Related Questions