Digger
Digger

Reputation: 174

Append to line that is preceded AND followed by empty line

I need to append an asterisk to a line, but only if said line is preceded and followed by empty lines (FYI, said empty lines will NOT have any white space in them).

Suppose I have the following file:

foo

foo
foo

foo

foo

I want the output to look like this:

foo

foo
foo

foo*

foo

I tried modifying the following awk command (found here):

awk 'NR==1 {l=$0; next}
       /^$/ {gsub(/test/,"xxx", l)}
       {print l; l=$0}
       END {print l}' file

to suit my uses, but got all tied up in knots.

Sed or Perl solutions are, of course, welcome also!

UPDATE:

It turned out that the question I asked was not quite correct. What I really needed was code that would append text to non-empty lines that do not start with whitespace AND are followed, two lines down, by non-empty lines that also do not start with whitespace.

For this revised problem, suppose I have the following file:

foo

third line foo

fifth line foo
 this line starts with a space foo
 this line starts with a space foo

ninth line foo

eleventh line foo

 this line starts with a space foo

last line foo

I want the output to look like this:

foobar

third line foobar

fifth line foo
 this line starts with a space foo
 this line starts with a space foo

ninth line foobar

eleventh line foo

 this line starts with a space foo

last line foo

For that, this sed one-liner does the trick:

sed '1N;N;/^[^[:space:]]/s/^\([^[:space:]].*\o\)\(\n\n[^[:space:]].*\)$/\1bar\2/;P;D' infile

Thanks to Benjamin W.'s clear and informative answer below, I was able to cobble this one-liner together!

Upvotes: 1

Views: 513

Answers (7)

zdim
zdim

Reputation: 66883

A perl one-liner

perl -0777 -lne's/(?<=\n\n)(.*?)(\n\n)/$1\*$2/g; print' ol.txt

The -0777 "slurps" in the whole file, assigned to $_, on which the (global) substitution is run and which is then printed.

The lookbehind (?<=text) is needed for repeating patterns, [empty][line][empty][line][empty]. It is a "zero-width assertion" that only checks that the pattern is there without consuming it. That way the pattern stays available for next matches.

Such consecutive repeating patterns trip up the /(\n\n)(.*?)(\n\n)/$1$2\*$3/, posted initially, since the trailing \n\n are not considered for the start of the very next pattern, having been just matched.

Upvotes: 1

John Hascall
John Hascall

Reputation: 9416

One way to think about these problems is as a state machine.

start: state = 0

0: /* looking for a blank line */
   if (blank line) state = 1

1: /* leading blank line(s)
   if (not blank line) {
       nonblank = line
       state = 2
   }

2: /* saw non-blank line */
   if (blank line) {
       output noblank*
       state = 0
   } else {
       state = 1
   }

And we can translate this pretty directly to an awk program:

BEGIN {
        state = 0;                # start in state 0
}

state == 0 {                      # looking for a (leading) blank line
        print;
        if (length($0) == 0) {    #   found one
                state = 1;
                next;
        }
}

state == 1 {                      # have a leading blank line
        if (length($0) > 0) {     #   found a non-blank line
                saved = $0;       #     save it
                state = 2;
                next;
        } else {
                print;            # multiple leading blank lines (ok)
        }
}

state == 2 {                      # saw the non-blank line
        if (length($0) == 0) {    #   followed by a blank line
                print saved "*";  #     BINGO!
                state = 1;        # to the saw a blank-line state
        } else {                  # nope, consecutive non-blank lines
                print saved;      #   as-is
                state = 0;        # to the looking for a blank line state
        }
        print;
        next;
}

END {                             # cleanup, might have something saved to show
        if (state == 2) print saved;
}

This is not the shortest way, nor likely the fastest, but it's probably the most straightforward and easy to understand.

EDIT

Here is a comparison of Ed's way and mine (see the comments under his answer for context). I replicated the OP's input a million-fold and then timed the runnings:

# ls -l
total 22472
-rw-r--r--. 1 root root      111 Mar 13 18:16 ed.awk
-rw-r--r--. 1 root root 23000000 Mar 13 18:14 huge.in
-rw-r--r--. 1 root root      357 Mar 13 18:16 john.awk

# time awk -f john.awk < huge.in > /dev/null
2.934u 0.001s 0:02.95 99.3%     0+0k 112+0io 1pf+0w

# time awk -f ed.awk huge.in huge.in > /dev/null
14.217u 0.426s 0:14.65 99.8%    0+0k 272+0io 2pf+0w

His version took about 5 times as long, did twice as much I/O, and (not shown in this output) took 1400 times as much memory.

EDIT from Ed Morton: For those of us unfamiliar with the output of whatever time command John used above, here's the 3rd-invocation results from the normal UNIX time program on cygwin/bash using GNU awk 4.1.3:

$ wc -l huge.in
1000000 huge.in

$ time awk -f john.awk huge.in > /dev/null
real    0m1.264s
user    0m1.232s
sys     0m0.030s

$ time awk -f ed.awk huge.in huge.in > /dev/null
real    0m1.638s
user    0m1.575s
sys     0m0.030s

so if you'd rather write 37 lines than 3 lines to save a third of a second on processing a million line file then John's answer is the right one for you.

EDIT#3

It's the standard "time" built-in from tcsh/csh. And even if you didn't recognize it, the output should be intuitively obvious. And yes, boys and girls, my solution can also be written as a short incomprehensible mess:

s == 0 { print; if (length($0) == 0) { s = 1; next; } }
s == 1 { if (length($0) > 0) { p = $0; s = 2; next; } else { print; } }
s == 2 { if (length($0) == 0) { print p "*"; s = 1; } else { print p; s = 0; } print; next; }
END { if (s == 2) print p; }

Upvotes: 2

Benjamin W.
Benjamin W.

Reputation: 52122

A sed solution:

$ sed '1N;N;s/^\(\n.*\)\(\n\)$/\1*\2/;P;D' infile
foo

foo
foo

foo*

foo

N;P;D is the idiomatic way to look at two lines at the same time by appending the next one to the pattern space, then printing and deleting the first line.

1N;N;P;D extends that to always having three lines in the pattern space, which is what we want here.

The substitution matches if the first and last line are empty (^\n and \n$) and appends one * to the line between the empty lines.

Notice that this matches and appends a * also for the second line of three empty lines, which might not be what you want. To make sure this doesn't happen, the first capture group has to have at least one non-whitespace character:

sed '1N;N;s/^\(\n[^[:space:]].*\)\(\n\)$/\1*\2/;P;D' infile

Question from comment

Can we not append the * if the line two above begins with abc?

Example input file:

foo

foo
abc

foo

foo

foo

foo

There are three foo between empty lines, but the first one should not get the * appended because the line two above starts with abc. This can be done as follows:

$ sed '1{N;N};N;/^abc/!s/^\(.*\n\n[^[:space:]].*\)\(\n\)$/\1*\2/;P;D' infile
foo

foo
abc

foo

foo*

foo*

foo

This keeps four lines at a time in the pattern space and only makes the substitution if the pattern space does not start with abc:

1 {      # On the first line
    N    # Append next line to pattern space
    N    # ... again, so there are three lines in pattern space
}
N        # Append fourth line
/^abc/!  # If the pattern space does not start with abc...
    s/^\(.*\n\n[^[:space:]].*\)\(\n\)$/\1*\2/   # Append '*' to 3rd line in pattern space
P        # Print first line of pattern space
D        # Delete first line of pattern space, start next cycle

Two remarks:

  1. BSD sed requires an extra semicolon: 1{N;N;} instead of 1{N;N}.
  2. If the first and third line of the file are empty, the second line does not get an asterisk appended because we only start checking once there are four lines in the pattern space. This could be solved by adding an extra substitution into the 1{} block:

    1{N;N;s/^\(\n[^[:space:]].*\)\(\n\)$/\1*\2/}
    

    (remember the extra ; for BSD sed), but trying to cover all edge cases makes sed even less readable, especially in one-liners:

    sed '1{N;N;s/^\(\n[^[:space:]].*\)\(\n\)$/\1*\2/};N;/^abc/!s/^\(.*\n\n[^[:space:]].*\)\(\n\)$/\1*\2/;P;D' infile
    

Upvotes: 5

karakfa
karakfa

Reputation: 67467

alternative awk solution (single pass)

$ awk 'NR>2 && !pp && !NF {p=p"*"} 
                      NR>1{print p} 
                          {pp=length(p);p=$0} 
                       END{print p}' foo       

foo                                                                                                                   

foo                                                                                                                   
foo                                                                                                                   

foo*                                                                                                                  

foo         

Explanation: defer printing to next line for decision making, so need to keep previous line in p and state of the second previous line in pp (length zero assumed to be empty). Do the bookkeeping assignments and at the end print the last line.

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203324

It's simplest and clearest to do this in 2 passes:

$ cat tst.awk
NR==FNR { nf[NR]=NF; nr=NR; next }
FNR>1 && FNR<nr && NF && !nf[FNR-1] && !nf[FNR+1] { $0 = $0 "*" }
{ print }

$ awk -f tst.awk file file
foo

foo
foo

foo*

foo

The above takes one pass to record the number of fields on each line (NF is zero for an empty line) and then the second pass just checks your requirements - the current line is not the first or last in the file, it is not empty and the lines before and after are empty.

Upvotes: 0

Jorgen
Jorgen

Reputation: 195

Update: My solution also fails after two consecutive matches as described above and needs the same lookback: s/(?<=\n\n)(\w+)\n\n/\1\2*\n\n/mg;

The easiest way is to use multi-line match:

    local $/;     ## slurp mode
    $file = <DATA>;

    $file =~ s/\n\n(\w+)\n\n/\n\n\1*\n\n/mg;
    printf $file;

    __DATA__
    foo

    foo
    foo

    foo

    foo

Upvotes: 0

hobbs
hobbs

Reputation: 239841

Here's a perl filter version, for the sake of illustration — hopefully it's clear to see how it works. It would be possible to write a version that has a lower input-output delay (2 lines instead of 3) but I don't think that's important.

my @lines;

while (<>) {
    # Keep three lines in the buffer, print them as they fall out
    push @lines, $_;
    print shift @lines if @lines > 3;

    # If a non-empty line occurs between two empty lines...
    if (@lines == 3 && $lines[0] =~ /^$/ && $lines[2] =~ /^$/ && $lines[1] !~ /^$/) {
        # place an asterisk at the end
        $lines[1] =~ s/$/*/;
    }
}

# Flush the buffer at EOF
print @lines;

Upvotes: 1

Related Questions