Reputation: 19318

Delete newline character in text file if next line is less than a certain length

I'd like to create a script with any combination of bash, sed, awk, or perl that deletes the newline character of a line if the next line is less than a certain length. Let's say we want to delete the newline character if the next line is less than 5 characters. If we have this source text file:

hi hi hi hi hi
bye
fun fun fun fun fun
batman
shirt shirt shirt
pants pants pants
belt
paper paper paper

Here's the desired output:

hi hi hi hi hibye
fun fun fun fun fun
batman
shirt shirt shirt
pants pants pantsbelt
paper paper paper

Here's a script that identifies all the lines that are less than 5 characters:

cat source.txt | awk 'length($0) < 5 { print NR }'

It returns this.

2
7

Here's a script that gets rid of the newlines (it's the line numbers from the previous script minus one):

perl -pe 'chomp if $.==1||$.==6' source.txt

How do I combine these two scripts? Or is there a better way to solve this?

Update

There were multiple correct answers (some didn't work on my Mac, but I think they'd work on other machines). Here's how long the correct answers took on my machine with a 769,811 line CSV file (40,000 lines had the newline character removed).

Ed Morton's awk solution: 23.7 seconds
wolfrevokcats perl with slurp: 4.5 seconds
John1024's solution didn't work on my Mac (but think it works on other OSs)
ikegami's perl without slurp: Killed the task after 7 minutes

Upvotes: 2

Answers (5)

John1024

Reputation: 113864

sed is also good for simple substitutions such as this:

$ sed -E ':a; N; s/\n(.{,4})$/\1/; ba' source
hi hi hi hi hibye
fun fun fun fun fun
batman
shirt shirt shirt
pants pants pantsbelt
paper paper paper

How it works:

:a

This defines a label a.
N

This reads in the next line and appends it (with a newline character) to the current contents of the pattern space.
s/\n(.{,4})$/\1/

If a newline character occurs within 4 characters before the end of the current line, then remove the newline
ba

If the above substitution command resulted in a change to the line, then jump back to label a.

BSD/MacOs

The above was tested with GNU sed. For BSD/macOS sed, try:

sed -E -e :a -e N -e 's/\n(.{,4})$/\1/' -e ba source

Upvotes: 0

ctac_

Reputation: 2471

You can try this sed (ok on OpenBSD)

sed -e '$b' -e 'N;/\n...../{P;D' -e '};y/\n/ /;s/ \([^ ]*$\)/\1/' infile

Upvotes: 0

ikegami

Reputation: 385897

If you want to avoid slurping and you want to look ahead, the general solution is to buffer as many lines as you want to look ahead. One in this case.

perl -ne'
   chomp;
   if (length >= 5) {
      print "$buf\n";
   } else {
      print $buf;
   }

   $buf .= $_;

   END { print "$buf\n" if defined $buf; }
'

In this particular case, you can make do with the following:

perl -pe'chomp; print "\n" if length >= 5 && $. > 1; END { print "\n" if $. }'

Both of these solutions handle inputs that don't have a line feed on the last line.

See Specifying file to process to Perl one-liner for usage.

Upvotes: 1

Ed Morton

Reputation: 203674

As in life, in software it's much easier to do things based on what has happened rather than what will happen. Don't think of any problem has needing to do X if the NEXT line contains Y, think of it as needing to do Z if the CURRENT line contains Y and then the solution is always simple and obvious, e.g.:

$ cat tst.awk
NR>1{ printf "%s%s", prev, (length() < 5 ? "" : ORS) }
{ prev = $0 }
END{ print prev }

$ awk -f tst.awk file
hi hi hi hi hibye
fun fun fun fun fun
batman
shirt shirt shirt
pants pants pantsbelt
paper paper paper

In the above we print a newline if the CURRENT line length is 5 or more. It's clear and simple and will work with any awk in any shell on any UNIX box.

Upvotes: 4

wolfrevokcats

Reputation: 2100

perl -p0777e "s{\r?\n(?=.{0,5}$)}{}mg" test.txt

output

hi hi hi hi hibye
fun fun fun fun fun
batman
shirt shirt shirt
pants pants pantsbelt
paper paper paper

[ Well it took me 2 minutes to write the one-liner and about an hour to explain. ]

Here's the explanation:

Switches:

-p - read every line of the input files, run the code specified by -e for each line, and print the variable $_ (which is modified by the -e code)

-0[octal number] - input line separator; if we specify 0777 the whole file will be considered a line and read at once

-l - strip input lines from ending \n, set the output line separator equal to the input line separator. (I removed it, cause it's actually not needed here)

Now the regular expression:

s{\r?\n(?=.{0,5}$)}{}mg

s{pattern}{replacement} - search for pattern in variable $_ and replace it with replacement

pattern parts:

\r?\n - match every newline symbol. For Unix \n would be enough, \r? - optional match of CR that may be necessary for old perl versions under Windows. Actually I think \r? can be removed too.

(?=pattern) - a positive look-ahead match of pattern, a zero width match, that is it does not consume the characters.

.{0,5}$ - match from zero to five characters ending with

s{}{} operator modifiers: m - multiline matching, makes $ match just before \n everywhere in text, not only at the end of the line. g - global matching, replace every occurrence in the text.

Finally, how it all works:

Perl slurps the whole file (-0777) and (-p), then it searches for every occurence of \r?\n that is followed by no more than 5 non-newline characters and a newline: (?=.{0,5}$).
Every occurrence is replaced by the empty string {}.

I think I've been clear enough.

Additional information can be obtained from: perldoc perlre, perldoc perlop , perldoc perlrun.

Upvotes: 2

Delete newline character in text file if next line is less than a certain length

Answers (5)

BSD/MacOs

Related Questions