Reputation: 56
I have been receiving a text file where each row should be 246 columns in length. For some reason an errant CRLF is being inserted in the file after every 23,036 characters, causing all sorts of problems.
The file is in a windows format, all line endings are CRLF.
Is there some way to strip out these extra CR-LF characters from this file, without disturbing the CRLF that exists at the end of every other line? Unix tools would be the preferred method here, if possible (awk, sed, etc).
Below is a sample of how the block of text looks like when there is an extra CRLF character added. Please note, this file is 258 Meg in size, and that extra CRLF occurs along the line in different places further down the file.
Upvotes: 2
Views: 2115
Reputation: 247012
with awk
awk '
length($0) != 247 {sub(/\r$/,""); printf "%s", $0; next}
{print}
' file
Note that "unix" text files have \n
line endings, so \r
is just a plain character. That's why I use 246+1 as the record length, and remove the CR from record fragments.
Update: yes, the above answer is incorrect: it will not properly append ONLY the next line, but the next TWO lines. Try this:
awk '
length($0) != 247 {sub(/\r$/,""); printf "%s", $0; getline; print; next}
{print}
' file
When it detects a short line, remove the CR and print it with no newline. Then read the next line, which I assume is the rest of that record, and print it with the CR intact. Then move on to the next record.
Upvotes: 0
Reputation: 23870
Here's a simple perl script that runs a loop, where in every iteration, it copies 23036 bytes to the output and then skips the CRLF that follows.
#!/usr/bin/perl
use strict;
use warnings;
while (1) {
my $r=read STDIN,my $buf,23036;
defined $r or die "error: $!";
last if $r<23036;
print $buf;
my $c=read STDIN,my $crlf,2;
defined $c or die "error: $!";
$crlf eq "\r\n" or die "Not a CRLF";
}
You run it like this:
./myscript.pl < input-file.txt > output-file.txt
Upvotes: 0
Reputation: 20022
When your not sure what position, you can delete all line endings and add them at the right places:
(tr -d "\r\n" < my_inputfile | fold -w 245;echo) | sed 's/$/\r/'
The echo
is needed, since fold
will not add a newline for the last line.
Upvotes: 1