z-vap
z-vap

Reputation: 56

Remove CRLF from the middle of a file

I have been receiving a text file where each row should be 246 columns in length. For some reason an errant CRLF is being inserted in the file after every 23,036 characters, causing all sorts of problems.

The file is in a windows format, all line endings are CRLF.

Is there some way to strip out these extra CR-LF characters from this file, without disturbing the CRLF that exists at the end of every other line? Unix tools would be the preferred method here, if possible (awk, sed, etc).

Below is a sample of how the block of text looks like when there is an extra CRLF character added. Please note, this file is 258 Meg in size, and that extra CRLF occurs along the line in different places further down the file.

enter image description here

Upvotes: 2

Views: 2115

Answers (3)

glenn jackman
glenn jackman

Reputation: 247012

with awk

awk '
    length($0) != 247 {sub(/\r$/,""); printf "%s", $0; next} 
    {print}
' file

Note that "unix" text files have \n line endings, so \r is just a plain character. That's why I use 246+1 as the record length, and remove the CR from record fragments.


Update: yes, the above answer is incorrect: it will not properly append ONLY the next line, but the next TWO lines. Try this:

awk '
    length($0) != 247 {sub(/\r$/,""); printf "%s", $0; getline; print; next} 
    {print}
' file

When it detects a short line, remove the CR and print it with no newline. Then read the next line, which I assume is the rest of that record, and print it with the CR intact. Then move on to the next record.

Upvotes: 0

redneb
redneb

Reputation: 23870

Here's a simple perl script that runs a loop, where in every iteration, it copies 23036 bytes to the output and then skips the CRLF that follows.

#!/usr/bin/perl
use strict;
use warnings;

while (1) {
    my $r=read STDIN,my $buf,23036;
    defined $r or die "error: $!";
    last if $r<23036;
    print $buf;
    my $c=read STDIN,my $crlf,2;
    defined $c or die "error: $!";
    $crlf eq "\r\n" or die "Not a CRLF";
}

You run it like this:

./myscript.pl < input-file.txt > output-file.txt

Upvotes: 0

Walter A
Walter A

Reputation: 20022

When your not sure what position, you can delete all line endings and add them at the right places:

(tr -d "\r\n" < my_inputfile | fold -w 245;echo) | sed 's/$/\r/'

The echo is needed, since fold will not add a newline for the last line.

Upvotes: 1

Related Questions