Reputation: 13
I have a folder of several hundred text file. Each file has the same format, for instance the file with the name ATextFile1.txt
reads
ATextFile1.txt 09 Oct 2013
1
2
3
4
...
I have a simplified Perl script that is supposed to read the file and print it back out in the terminal window:
#!/usr/bin/Perl
use warnings;
use strict;
my $fileName = shift(@ARGV);
open(my $INFILE, "<:encoding(UTF-8)", $fileName) || die("Cannot open $fileName: $!.\n");
foreach (<$INFILE>){
print("$_"); # Uses the newline character from the file
}
When I use this script on files generated by the Windows version of the program that generates the ATextFile1.txt
, my output is exactly as I'd expect (being the content of the text file), however, when I run this script on files generated by the Mac version of the file generating program, the output looks like the following:
2016tFile1.txt 09 Oct 2013
After some testing, it seems that it is only printing the first line of the text where the first 4 characters are overwritten by what can be expressed in RegEx as /[0-9][0-9]16/
. If in my Perl script, I replace the output statement with print("\t$_");
, I get the following line printed to STDOUT:
2016 ATextFile1.txt 09 Oct 2013
Each of these files can be read normally using any standard text editor but for some reason, my Perl script can't seem to properly read and write from the file. Any help would be greatly appreciated (I'm hoping it's something obvious that I'm missing). Thanks in advance!
Upvotes: 1
Views: 89
Reputation: 126732
Note that if you are printing UTF-8 characters to STDOUT
you will need to use
binmode STDOUT, ':encoding(utf8)';
beforehand.
It looks as if your Mac files have just CR as the line ending. I understood that recent versions of Macintosh systems used LF as the line ending (the same as Linux) but Mac OS 9 uses just CR, while Windows uses the two characters CR LF inside the file, which is converted to just LF by the PerlIO layer when perl is running in a Windows platform.
If there are no linefeeds in the file, then Perl will read the entire file as a single record, and printing it will overlay all lines on top of one another.
As long as the files are relatively small, the easiest way to read either file format with the same Perl code is to read the whole file and split it on either CR or LF. Anything else will need different code according to the source of the input files.
Try this version of your code.
use strict;
use warnings;
my @contents = do {
open my $fh, '<:encoding(utf8)', $ARGV[0];
local $/;
my $contents = <$fh>;
split /[\r\n]+/, $contents;
}
print "$_\n" for @contents;
Update
One alternative you might try is to use the PerlIO::eol
module, which provides a PerlIO layer that translates any line ending to LF when the record is read. I'm not certain that it plays nice with UTF-8, but as long as you add it after the encoding
layer it should be fine.
It is not a core module so you will probably need to install it, but after that the program becomes just
use strict;
use warnings;
open my $fh, '<:encoding(UTF-8):eol(LF)', $ARGV[0];
binmode STDOUT, ':encoding(utf8)';
print while <$fh>;
I have created Windows, Linux, and Mac-style text files and this program works fine wioth all of them, but I have been unable to check whether a UTF-8 character that has 0x0D or 0x0A as part of its encoding are passed through properly, so be careful.
Update 2
After thinking briefly about this, of course there are no UTF-8 encodings that contain CR or LF apart from those characters themselves. All characters outside the ASCII range contain only bytes with the top bit set, so they are over 0x80
and can never be 0x0D
or 0x0A
.
Upvotes: 3