starfry
starfry

Reputation: 9943

How, using sed, can one extract a regex-delimited range except for the last line?

A simple sed expression to extract a block of lines delimited by regular expressions from a text file looks like this:

$ sed -n -e '/start-regex/,/end-regex/ p' input_file

This selects lines from and including the line matching start-regex up to and including the line matching end-regex.

The line matching end-regex may be excluded like this:

$ sed -n -e '/start-regex/,/end-regex/ {/end-regex/d;p}

Is it possible to do this without repeating end-regex ?

If it's possible to omit the last line, then would it follow that it's also possible to omit the first and/or last line without repeating the regexes ?

The reason for this question is to find a more efficient way of solving the problem than repeating expressions which can be complex and hard to read.

This question is about sed, and a single instance thereof, specifically. There may be ways to do this with pipelines of head, tail, awk, etc, but the question asks if this is possible using sed only.

There are a number of similar questions but they ask for solutions to specific use-cases rather than dealing with the generic problem at source.

Any solution should work with GNU sed.

Upvotes: 3

Views: 927

Answers (3)

stevesliva
stevesliva

Reputation: 5655

The second example below is a sed-only answer that pads the output with blank lines. The third example gives exactly what has been asked for, provided you can choose a pattern that's never in the range that should be kept.

If, within your input file, the range matches only once, this works. It prints what you want starting with a blank line.

sed -n -e '/start-regex/,/end-regex/{x;p}' input-file

For each line in the range, x exchanges the line in the pattern space with the line in the hold space, and p prints the line pulled from the hold space. This is effectively printing every preceeding line.

But, as said, that only works if the range occurs once. If the range occurs more than once, the line matching end-regex is still in the hold space.

So instead, the script below empties out the lines outside the range, stuffs that empty line in the hold space with h, and then runs the x;p which will print a blank line for start-regex and nothing for end-regex:

sed -n -e '/start-regex/,/end-regex/! {s/.//g;h;};x;p' ' input-file

The above, is the most general I can give. It retains blank lines within the range, but is not a perfect solution because it inserts blank lines before the range:


start-regex line 1
  next line is blank...
etc1
start-regex line 2 etc2

To delete blank lines, you can change the final p to /^$/! p, but that will omit blank lines within the input-file range as well as the padding lines added before each range by the script. If you really can't stomach the added blank lines, you could always stick in a placeholder on the non-matching lines:

sed -n -e '/start-regex/,/end-regex/! {s/.*/OMITME/;h;};x;/OMITME/! p' ' input-file

And that still depends on OMITME not being a pattern in the range you want to keep. But you get the desired result:

start-regex line 1
  next line is blank...

  etc1
start-regex line 2
  etc2

Upvotes: 0

Jonathan Leffler
Jonathan Leffler

Reputation: 753695

BSD and GNU sed both agree that you can omit both the first and the last line in the range without repeating either regex, but it is a tad quirky.

sed -n -e '/first-regex/,/second-pattern/ { //!p; }'

(BSD sed requires the semicolon; GNU sed doesn't mind whether it is there or not.)

The empty regex // matches the last regular expression that matched, and in this context, that is either the first pattern (at the beginning of the range) or the second pattern (at the end of the range). Note that the ranges should be disjoint if there is more than one such range.

Given an input file called data (I happened to have this around from playing with another question):

0x0  = 0
0x1  = 1
0x2  = 2
0x3  = 3
0x4  = 4
0x5  = 5
0x6  = 6
0x7  = 7
0x8  = 8
0x9  = 9
0xA  = 0
0xB  = 11
0xC  = 12
0xD  = 13
0xE  = 14
0xF  = 15

you can run:

$ sed -n -e '/0x4/,/0xC/ { //!p; }' data
0x5  = 5
0x6  = 6
0x7  = 7
0x8  = 8
0x9  = 9
0xA  = 0
0xB  = 11
$

I've not yet found a way to omit one of the two patterns (the start or the end pattern) rather than both. My suspicion is that it cannot be done in sed without repeating one or the other regex.

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203493

Never use ranges for exactly this reason (they need a rewrite or duplicate conditions given the slightest requirements change). Use a flag instead:

awk '/start/{f=1} /end/{f=0} f' file

That means you cannot do this in any concise, portable way with sed (there MAY be some bizarre combination of single character runes that will do what you want in GNU sed but if you think repeating the condition is complex and hard to read wait til you see that!), you need a tool like awk that supports variables. With the above approach you can print from all to none of the delimiters just by rearranging the 3 parts of the script (added the {print} just for clarity vs relying on the default behavior):

$ seq 1 10 | awk '/3/{f=1} f{print} /7/{f=0}'
3
4
5
6
7

$ seq 1 10 | awk 'f{print} /3/{f=1} /7/{f=0}'
4
5
6
7

$ seq 1 10 | awk '/3/{f=1} /7/{f=0} f{print}'
3
4
5
6

$ seq 1 10 | awk '/7/{f=0} f{print} /3/{f=1}'
4
5
6

Upvotes: 3

Related Questions