Klausi
Klausi

Reputation: 11

Remove two lines using sed

I'm writing a script which can parse an HTML document. I would like to remove two lines, how does sed work with newlines? I tried

sed 's/<!DOCTYPE.*\n<h1.*/<newstring>/g'

which didn't work. I tried this statement but it removes the whole document because it seems to remove all newlines:

sed ':a;N;$!ba;s/<!DOCTYPE.*\n<h1.*\n<b.*/<newstring>/g'

Any ideas? Maybe I should work with awk?

Upvotes: 0

Views: 405

Answers (4)

Qualia
Qualia

Reputation: 729

For the simple task of removing two lines if each matches some pattern, all you need to do is:

sed '/<!DOCTYPE.*/{N;/\n<h1.*/d}'

This uses an address matching the first line you want to delete. When the address matches, it executes:

  • Next - append the next line to the current pattern-space (including \n)

Then, it matches on an address for the contents of the second line (following \n). If that works it executes:

  • delete - discard current input and start reading next unread line

If d isn't executed, then both lines will print by default and execution will continue as normal.

To adjust this for three lines, you need only use N again. If you want to pull in multiple lines until some delimiter is reached, you can use a line-pump, which looks something like this:

/<!DOCTYPE.*/{
    :pump
    N
    /some-regex-to-stop-pump/!b pump
    /regex-which-indicates-we-should-delete/d
}

However, writing a full XML parser in sed or awk is a Herculean task and you're likely better off using an existing solution.

Upvotes: 2

potong
potong

Reputation: 58371

This might work for you (GNU sed):

sed 'N;/<!DOCTYPE.*\n<h1.*/d;P;D' file

Append the following line and if the pattern matches both lines in the pattern space delete them.

Otherwise, print then delete the first of the two lines and repeat.

To replace the two lines with another string, use:

sed 'N;s/<!DOCTYPE.*\n<h1.*/another string/;P;D'

Upvotes: 0

Klausi
Klausi

Reputation: 11

My solution for a document like this:

<b>...
<first...
<second...
<third...
<a ...

this awk command works well:

awk -v RS='<first[^\n]*\n<second[^\n]*\n<third[^\n]*\n' '{printf "%s", $0}'     

that's all.

Upvotes: 0

Raman Sailopal
Raman Sailopal

Reputation: 12867

If an xml parsing tool is definitely not an option, awk maybe an option:

awk '/<!DOCTYPE/ { lne=NR+1;next } NR==lne && /<h1/ { next }1' file

When we encounter a line with "<!DOCTYPE" set the variable lne to the line number + 1 (NR+1) and then skip to the next line. Then when the line is equal to lne (NR==lne) and the line contains "<h1", skip to the next line. Print all other lines by using 1.

Upvotes: 0

Related Questions