Gregg Seipp
Gregg Seipp

Reputation: 173

Need a script for stripping extra Line Feed characters from text files

I'm running perl in Windows and I've got some text files for which the lines in CRLF (0d0a). Problem is, there are these occasional 0a characters sprinkled around the file that are splitting lines in Windows perl and mucking with my processing. My thought is to preprocess the file, reading lines split by CRLF but, at least in Windows, it insists on splitting on LF as well.

I've tried setting $/

local $/ = 0x0d; 
open(my $fh, "<", $file) or die "Unable to open $file";
while (my $line = <$fh>) {
    # do something to get rid of the 0x0a embedded in the line of text; 
}

...but this reads multiple lines...it seems to miss the 0x0d altogether. I've also tried setting it to "\n", "\n\r", "\r" and "\r\n". There must be a simple way to do this!

I need to get rid of the so I can correctly process the file. So, I need a script that will open the file, split the file on CRLF, find any 0a that isn't preceded by an 0d, blast it and save it, line by line, to a new file.

Thanks for any help you can provide.

Upvotes: 2

Views: 119

Answers (2)

Gregg Seipp
Gregg Seipp

Reputation: 173

This solution works by reading the data in using binary mode.

open(my $INFILE, "<:raw", $infile)
    or die "Can't open \"$infile\": $!\n");
open(my $OUTFILE, ">:raw", $outfile)
    or die "Can't create \"$outfile\": $!\n");

my $buffer = '';
while (sysread($INFILE, $buffer, 4*1024*1024)) {
    $buffer =~ s/(?<!\x0D)\x0A//g;

    # Keep one char in case we cut between a CR and a LF.
    print $OUTFILE substr($buffer, 0, -1, '');
}

print $OUTFILE $buffer;

Upvotes: 2

ikegami
ikegami

Reputation: 386331

For starters, local $/ = 0x0d; should be local $/ = "\x0d";.

Aside from that, the problem is that a :crlf layer is added to file handles in Windows by default. This causes CRLF to be converted to LF on read (and vice-versa on write). There are therefore no CR in what you read, so you end up reading the entire file.

Simply removing/disabling the :crlf will do the trick.

local $/ = "\x0D\x0A";
open(my $fh, "<:raw", $file)
    or die("Can't open \"$file\": $!\n");

while (<$fh>) {
    chomp;
    s/\x0A//g;
    say;
}

Upvotes: 2

Related Questions