ev0lution37
ev0lution37

Reputation: 1179

Sed Regular Expression affecting content after the Regex

I have an HTML file containing the following text:

<!doctype html><html><head><meta charset="utf-8"><title>Test</title><base href="/"><meta name="viewport" content="width=device-width,initial-scale=1"></head><body>test</body></html>

And I run this sed command against it:

sed -i -e "s:<base href\s*=\s*\".*\"\s*>:<base href=\"/apps/test/\">:g" /tmp/test/index.html

I'd expect for that just to replace <base href="/"> with <base href="/apps/test/"> and leave the rest alone, but it ends up affecting content after the regex:

 <!doctype html><html><head><meta charset="utf-8"><title>Test</title><base href="/apps/test/"></head><body>test</body></html>

It ended up removing the entire meta tag found after the regex. Am I just not doing the regex right?

GNU sed version 4.2.1

Upvotes: 0

Views: 109

Answers (2)

Benjamin W.
Benjamin W.

Reputation: 52536

Because * is greedy, the .* in =\s*\".*\"\s*> matches to the furthest right > available.

You can use single quotes around your command so you don't have to use \" for double quotes. Then, instead of ".*", you can use "[^"]*", which only matches to the next double quote.

This would make your command into

sed 's:<base href\s*=\s*"[^"]*"\s*>:<base href="/apps/test/">:g'

However, manipulating HTML with sed and regexes is eternally brittle and will break at the first possible opportunity. You could use an XML/HTML parser such as xmllint, see Roman's answer; an alternative are the W3C HTML-XML-utils with their hxpipe and hxunpipe commands.

These commands parse your HTML and turn it into a format easily processed with sed, awk & friends, then turn it back into HTML:

$ hxpipe infile.html
!html "" 
(html
(head
Acharset CDATA utf-8
(meta
(title
-Test
)title
Ahref CDATA /
(base
Aname CDATA viewport
Acontent CDATA width=device-width,initial-scale=1
(meta
)head
(body
-test
)body
)html
-\n

so to turn the / in the href for the base tag into /apps/test/, we could do this:

$ hxpipe infile.html \
    | sed '/Ahref CDATA/{N;/\n(base$/s|\(CDATA\) .*|\1 /apps/test/|}' \
    | hxunpipe
<!DOCTYPE html><html><head><meta charset="utf-8"><title>Test</title><meta href="/apps/test/" name="viewport" content="width=device-width,initial-scale=1"></head><body>test</body></html>

where the sed command

sed '/Ahref CDATA/{N;/\n(base$/s|\(CDATA\) .*|\1 /apps/test/|}'

or, better readable

/Ahref CDATA/ {                                # If line matches this
    N                                          # Append next line
    /\n(base$/ s|\(CDATA\) .*|\1 /apps/test/|  # If in base tag, replace href
}

in a more or less robust fashion makes your change.

Upvotes: 3

RomanPerekhrest
RomanPerekhrest

Reputation: 92904

The only right way for processing xml/html data is to use xml/html parsers.

xmlstarlet solution:

xmlstarlet fo -R -H /tmp/test/index.html | xmlstarlet ed -O -u '//base/@href' -v '/apps/test/'

The output:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8"/>
    <title>Test</title>
    <base href="/apps/test/"/>
    <meta name="viewport" content="width=device-width,initial-scale=1"/>
  </head>
  <body>test</body>
</html>

To modify the file in-place add -L option: xmlstarlet ed -L -u ....

Upvotes: 2

Related Questions