user3258505
user3258505

Reputation: 11

Regex to extract plain-text XML nodes

I have a LARGE XML file. I'm troubleshooting some things, and I would like to extract specific nodes from the XML file. I don't want a SimpleXML object, I want to make a new file with the raw string matching what I want (posting this on bash/sed/php).

<?xml version="1.0" encoding="UTF-8"?>
<definition></definition>
    <metadata></metadata>
    <nodeToRegex>
        <nodeImightwant>
            <subnode>
                <subsubnode1></subsubnode1>
                <subsubnodeToCheck>stringCheck</subnodeToCheck>
                <subsubnode2></subsubnode2>
            </subnode>
        </nodeImightwant>
        <nodeImightwant></nodeImightwant>
        <nodeImightwant></nodeImightwant>
    </nodeToRegex>

So from this XML file, I want all lines from every node except the nodeToRegex. From nodeToRegex, I only want the nodeImightwant if the stringCheck string equals "aValidString". Can this be done via regex or should I just copy and paste the stuff out of the file? (my regex skills are subpar)

Upvotes: 0

Views: 977

Answers (1)

elixenide
elixenide

Reputation: 44851

Don't parse XML with regexes. There is no reason you can't repackage/rearrange the data using SimpleXML, but trying to do it with a regex is a recipe for lots of headaches and, ultimately, broken code.

See this classic example for why parsing XML/HTML/XHTML with regexes is the road to madness.

If you insist on using a regex, just replace the nodes you don't want, like this:

$myxml = preg_replace('~<nodeToRegex>.*?</nodeToRegex>~', '', $myxml);

Regular expression visualization

Debuggex Demo

Upvotes: 1

Related Questions