Reputation: 10191
I'm desperately trying to search the following:
<texit info> author=MySelf title=MyTitle </texit>
and replace it with blank.
What I've tried so far is the following:
sed –I '1,5s/<texit//;s/info>//;s/author=MySelf//;s/title=MyTitle//' test.txt
But it doesn't work.
Upvotes: 0
Views: 320
Reputation: 295989
Don't edit XML with sed -- the right tool would be something like XMLStarlet, with a line like the following:
xmlstarlet ed -u //texit[@info] -v 'author=NewAuthor title=NewTitle'
...if your goal were to update the text within the tag.
Regular expressions are not expressive enough to correctly handle XML (even formally -- regular expressions are theoretically sufficient to parse regular languages; XML is not one). For instance, your original would be just as valid written with newlines, as:
< texit
info >author=MySelf title=MyTitle</texit>
...and writing a sed command to handle that case would not be fun. XML-native tools, on the other hand, can correctly handle all of XML's corner cases.
That said, the sed expression you gave does indeed "work", inasmuch as it does exactly what it's written to do.
sed -e '1,5s/<texit//;s/info>//;s/author=MySelf//;s/title=MyTitle//' \
<<<"<texit info>author=MySelf title=MyTitle foo bar</texit>"
returns the output
foo bar</texit>
which is exactly what it should do, as it's removing the <texit
string, the info>
string, the author=MySelf
, title=MyTitle
, but leaving the closing </texit>
and any excess text, just as you asked. If you expect or desire it to do something different, you should explain what that is.
Upvotes: 2
Reputation: 20300
sed 's/<texit\s\+info>\s*author=MySelf\s\+title=MyTitle\s*<\/texit>//g' test.txt
You should generally not edit XML with a regex, but if you only want to strip these tags, the above will work. You don't need multiple s
commands, just use a single pattern with correctly defined whitespace.
Upvotes: 2