Sed Regular Expression affecting content after the Regex

Question

I have an HTML file containing the following text:

Testtest

And I run this sed command against it:

sed -i -e "s:::g" /tmp/test/index.html

I'd expect for that just to replace with and leave the rest alone, but it ends up affecting content after the regex:

 Testtest

It ended up removing the entire meta tag found after the regex. Am I just not doing the regex right?

GNU sed version 4.2.1

Benjamin W. · Accepted Answer

Because * is greedy, the .* in =\s*\".*\"\s*> matches to the furthest right > available.

You can use single quotes around your command so you don't have to use \" for double quotes. Then, instead of ".*", you can use "[^"]*", which only matches to the next double quote.

This would make your command into

sed 's:::g'

However, manipulating HTML with sed and regexes is eternally brittle and will break at the first possible opportunity. You could use an XML/HTML parser such as xmllint, see Roman's answer; an alternative are the W3C HTML-XML-utils with their hxpipe and hxunpipe commands.

These commands parse your HTML and turn it into a format easily processed with sed, awk & friends, then turn it back into HTML:

$ hxpipe infile.html
!html "" 
(html
(head
Acharset CDATA utf-8
(meta
(title
-Test
)title
Ahref CDATA /
(base
Aname CDATA viewport
Acontent CDATA width=device-width,initial-scale=1
(meta
)head
(body
-test
)body
)html
-\n

so to turn the / in the href for the base tag into /apps/test/, we could do this:

$ hxpipe infile.html \
    | sed '/Ahref CDATA/{N;/\n(base$/s|$CDATA$ .*|\1 /apps/test/|}' \
    | hxunpipe
Testtest

where the sed command

sed '/Ahref CDATA/{N;/\n(base$/s|$CDATA$ .*|\1 /apps/test/|}'

or, better readable

/Ahref CDATA/ {                                # If line matches this
    N                                          # Append next line
    /\n(base$/ s|$CDATA$ .*|\1 /apps/test/|  # If in base tag, replace href
}

in a more or less robust fashion makes your change.

Sed Regular Expression affecting content after the Regex

Answers (2)

Related Questions