How to Extract a Specific XML Node using SED

Question

First time posting here but not the first time using Stack Overflow as a resource. Must say this site has been integral to my work in general.

I have used sed in so many ways before but can't seem to figure out how I can return the full XML node, if and only if, one of its child nodes meets certain criteria. I know how to use the 2 addresses convention (/START/END/command) but need to restrict the result only to specific matching child nodes.

Example:


    Jane Doe
    US


    Jose Reyes
    Mexico


    Juan Dela Cruz
    Philippines


    William Shatner
    US

If I want to return the full entity node with id 003, I can use the following command:

sed -n '/entity id="003"/,/<\/entity>/p'

However, if I want to return the full entity nodes that match the country US, how should I go about that one?

I don't mind doing the work myself if you can point me to a general direction. In fact, I do prefer that one over spoon feeding.

Thanks!

jas · Accepted Answer

As you may have seen in comments on similar questions, the best thing for processing XML is a tool made for processing XML, and not a general text processing tool like sed or awk.

For example if you have access to xmlstarlet:

$ xmlstarlet sel -t -c "//entity[country = 'US']" file.xml

    Jane Doe
    US

    William Shatner
    US

Especially if you're going to be working with XML more than a little bit, I would put the effort into researching the available command line tools more suited for parsing XML.

If you're really stuck then awk would be a better option than sed, and awk should be available anywhere sed is:

$ cat a.awk

/US/ {
    if (f == 2) print s
    f = 0
}

$ awk -f a.awk file.xml
  
    Jane Doe
    US
  
  
    William Shatner
    US

How to Extract a Specific XML Node using SED

Answers (1)

Related Questions