jayceebee23
jayceebee23

Reputation: 11

How to Extract a Specific XML Node using SED

First time posting here but not the first time using Stack Overflow as a resource. Must say this site has been integral to my work in general.

I have used sed in so many ways before but can't seem to figure out how I can return the full XML node, if and only if, one of its child nodes meets certain criteria. I know how to use the 2 addresses convention (/START/END/command) but need to restrict the result only to specific matching child nodes.

Example:

<entity id="001">
    <name>Jane Doe</name>
    <country>US</country>
</entity>
<entity id="002">
    <name>Jose Reyes</name>
    <country>Mexico</country>
</entity>
<entity id="003">
    <name>Juan Dela Cruz</name>
    <country>Philippines</country>
</entity>
<entity id="004">
    <name>William Shatner</name>
    <country>US</country>
</entity>

If I want to return the full entity node with id 003, I can use the following command:

sed -n '/entity id="003"/,/<\/entity>/p'

However, if I want to return the full entity nodes that match the country US, how should I go about that one?

I don't mind doing the work myself if you can point me to a general direction. In fact, I do prefer that one over spoon feeding.

Thanks!

Upvotes: 0

Views: 619

Answers (1)

jas
jas

Reputation: 10865

As you may have seen in comments on similar questions, the best thing for processing XML is a tool made for processing XML, and not a general text processing tool like sed or awk.

For example if you have access to xmlstarlet:

$ xmlstarlet sel -t -c "//entity[country = 'US']" file.xml
<entity id="001">
    <name>Jane Doe</name>
    <country>US</country>
</entity><entity id="004">
    <name>William Shatner</name>
    <country>US</country>
</entity>

Especially if you're going to be working with XML more than a little bit, I would put the effort into researching the available command line tools more suited for parsing XML.

If you're really stuck then awk would be a better option than sed, and awk should be available anywhere sed is:

$ cat a.awk

/<entity id/ { f = 1; s = "" }

f { s = s ? (s ORS $0) : $0 }

/<country>US</ { f = 2 }

/<\/entity>/ {
    if (f == 2) print s
    f = 0
}

$ awk -f a.awk file.xml
  <entity id="001">
    <name>Jane Doe</name>
    <country>US</country>
  </entity>
  <entity id="004">
    <name>William Shatner</name>
    <country>US</country>
  </entity>

Upvotes: 1

Related Questions