Reputation: 16066
I have a XML file that looks like this:
<Group>
<Name>Awesome Group</Name>
<Notes />
<Date>2013-04-04</Date>
<Expires>False</Expires>
<Icon>7</Icon>
<Tags />
</Group>
I'm trying to print everything between <Notes />
and </Icon>
with this command:
$ sed -n '/\<Notes \/\>/ p' file.xml
Notice I'm escaping the open and close brackets as well as the forward slash before the close bracket. This returns no matches, which I find odd.
What's even more odd is that this command works:
$ sed -n '/<Notes \/>/ p' file.xml
Why does this command work, since I'm not escaping the open and close brackets?
EDIT
ruakh helpfully pointed out that there are different implementations of sed, and that open and close brackets don't need to be escaped (I thought sed used Perl syntax for regular expressions). I found another post on Unix & Linux that was also helpful: https://unix.stackexchange.com/questions/32907/what-characters-do-i-need-to-escape-when-using-sed-in-a-sh-script
Now I'm having a problem matching a multi-line regular expression. How come this doesn't work?
$ sed -n -r '/^<Notes \/>[\S\s]*?<\/Icon>$/ p' file.xml
I've tried with and without the -r
(extended mode), with and without the ^
and $
, using .*
instead of [\S\s]*
, all with no matches
Upvotes: 0
Views: 1273
Reputation: 204488
sed is an excellent tool for simple substitutions on a single line, for any other text manipulation you should use awk. Here's a GNU awk solution:
$ gawk -v RS='\0' '{print gensub(/.*(<Notes \/>.*<\/Icon>).*/,"\\1","")}' file
<Notes />
<Date>2013-04-04</Date>
<Expires>False</Expires>
<Icon>7</Icon>
Note that the above just prints between the symbols you asked for, not the whole lines that the symbols appeared on.
Upvotes: 1
Reputation: 183544
In sed, <
and >
have no special meaning, but \<
and \>
sometimes do: in some implementations, they mean "start of word" and "end of word". For example, this Bash command:
{ echo a ; echo ba ; echo b a ; } | sed -n '/\<a/ p'
will, on some systems, print a
and b a
(where there's an a
at the very start of a word), but not ba
(where there isn't).
(Judging from the tags you've chosen, you may be used to Perl? Perl makes a future-proof guarantee that \
, when it precedes a non-word character, will always escapes it. For example, <
has no special meaning, but \<
is guaranteed to mean <
anyway. But not all regex engines take that approach.)
Edit for edited question:
Sed processes one line at a time — that's part of what makes it a "stream editor" — so a multiline regex is essentially doomed to failure. However, in your case, you don't actually need a multiline regex; you just want to find the line that contains <Notes />
and the (distinct) line that contains </Icon>
, and print all lines between the two (inclusive). For that, you can use an address range, specifying a start-address of /<Notes \/>/
and an end-address of /<\/Icon>/
:
sed -n '/<Notes \/>/,/<\/Icon>/ p'
(See §3.2 "Selecting lines with sed
" in the GNU sed user's manual..)
Upvotes: 3