wwood
wwood

Reputation: 489

How can I use sed to get an xml value

How can I use sed to get the SOMETHING in <version.suffix>SOMETHING</version.suffix>?

I tried sed 's#.*>\(.*\)\<version\.suffix\>#\1#' ,but fails.

Upvotes: 2

Views: 7175

Answers (4)

sjnarv
sjnarv

Reputation: 2374

Assuming the formatting of the question is accurate, when I run the example in the question as-is:

$ echo '<version.suffix>SOMETHING</version.suffix>' | sed 's#.*>\(.*\)\<version\.suffix\>#\1#'

I see the following output:

SOMETHING</>

In case my formatting skills fail me, this output ends with the trailing left angle bracket, a forward slash, and finally the right angle bracket.

So, why this "failure"? Well, on my system (Linux with GNU grep 2.14), grep(1) includes the following snippet:

The Backslash Character and Special Expressions

The symbols \< and \> respectively match the empty string at the beginning and end of a word.

Other answers suggest good alternatives to extract the value in XML tag syntax; use them.

I just wanted to point out why the RE in the original problem fails on current Linux systems: some symbols match no actual characters, but instead match empty boundaries in these apps that support posix-extended regular expressions. So, in this example, the brackets in the source are matched in unexpected ways:

  • the (.*)has matched SOMETHING</, to be printed by the \1 back-reference
  • the left-hand side of version.suffix is matched by \<
  • version.suffix is matched by version\.suffix
  • the right-hand side of version.suffix is matched by \>
  • the trailing > character remains in sed's pattern space and is printed.

TL;DR -"\X" does not mean "just match an X" for all X!

Upvotes: 1

clt60
clt60

Reputation: 63892

Many ways possible, e.g:

with sed

echo '<version.suffix>SOMETHING</version.suffix>' | sed 's#<[^>]*>##g'

or grep

echo '<version.suffix>SOMETHING</version.suffix>' | grep -oP '<version.suffix>\KSOMETHING(?=</version.suffix>)'

Upvotes: 1

Zilicon
Zilicon

Reputation: 3860

Try this one:

sed 's/<.*>\(.*\)<.*>/\1/'

It should be general enough to get every xml value.

If you need to eliminate the indentation add \s* at the beginning like this:

sed 's/\s*<.*>\(.*\)<.*>/\1/'

Alternatively if you only want version.suffix's value, you can make the command more specific like this:

sed 's/<version\.suffix>\(.*\)<.*>/\1/'

Upvotes: 3

Avinash Raj
Avinash Raj

Reputation: 174696

You could use the below sed command,

$ echo '<version.suffix>SOMETHING</version.suffix>' | sed 's#^<[^>]*>\(.*\)<\/[^>]*>$#\1#'
SOMETHING
  • ^<[^>]*> Matches the first tag string <version.suffix>.
  • \(.*\)<\/[^>]*>$ Characters upto the next closing tag are captured. And the remaining closing tag was matched by this <\/[^>]*> regex.
  • Finally all the matched characters are replaced by the characters which are present inside the group index 1.

Your regex is correct but the only thing is, you forget to use / inside the closing tag.

$ echo '<version.suffix>SOMETHING</version.suffix>' | sed 's#.*>\(.*\)</version\.suffix>#\1#'
                                                                       |<-Here
SOMETHING

Upvotes: 1

Related Questions