Reputation: 3909
I am using perl's Tie::File
to parse through a .csv file and matching for a specific string, its actually the first string/header on the file.
The problem I am having might be my input file type. The tool that exports the data file can export in .csv or text which I tried and tested both.
Somehow, I am still NOT getting the match. My problem could be two-fold: (1) my regex is wrong and/or (2) the file type.
Sample file header/string (if I cat
the file):
??Global Mail_Date.Dat
Sample file header/string (if I open up in editor, apple's TextEdit.app)
Global Mail_Date.Dat
Here's the octal dump:
0000000 377 376 G \0 l \0 o \0 b \0 a \0 l \0 \0
feff 0047 006c 006f 0062 0061 006c 0020
0000020 \0 M \0 a \0 i \0 l \0 _ \0 D \0 a \0
0020 004d 0061 0069 006c 005f 0044 0061
0000040 t \0 e \0 . \0 D \0 a \0 t \0 \r \0 \n \0
0074 0065 002e 0044 0061 0074 000d 000a
Obviously, doing an os cat
shows a leading ??
on the string.
Code:
use strict;
use warnings;
use Tie::File;
use File::Copy;
for (@ARGV) {
tie my @lines, "Tie::File", $_;
#shift @lines if $lines[0] =~ /^Global/;
if ($lines[0] =~ /^Global/)
{
print "We have a match, remove the line ..";
#shift @lines if $lines[0] =~ /^Global/;
untie @lines;
}
else
{
print "No match found. Exit";
}
}
Upvotes: 2
Views: 564
Reputation: 107040
I'm looking at the octal dump and notice the null character between each of your regular characters. That is, it's G-\0-l-\0-o-\0-b-\0-a-\0-l-\0
and not G-l-o-b-a-l
. This means your file is not in ASCII text. Is this in UTF8 or UTF16? If so, you have to use the encoding
function when you open the file in Perl:
open(my $fh, "<:encoding(UTF-16)", $fileName)
or die qq(Can't open file "$fileName" for reading);
If this is a csv file, you should try the Text::CSV::Encoded module. This will help you parse your CSV file.
Upvotes: 1
Reputation: 52029
It looks like your file is encoded in utf16.
Try something like this:
binmode STDIN, ':encoding(UTF-16LE)';
while (<STDIN>) {
if (m/Global/) { # see note
print "Matched Global on line $.\n";
}
}
If you get a match then at least we know the encoding is correct.
To compensate for the BOM code-point, you could read in a single character after the binmode
call:
binmode STDIN, ':encodeing(UTF-16LE)';
read(STDIN, my $buf, 1);
while (<STDIN>) {
if (m/^Global/) { ... }
}
Upvotes: 1