Reputation: 7254
I need to parse XML files with regards to only one namespace.
By "with regards to only one namespace" I mean that if I have document like this:
<xc:document xmlns:xc="asdasd">
<asdf>
<xc:abcd />
</asdf>
</xc:document>
I would like <asdf>
, </asdf>
to be treated as text.
The structure of this document should look like this:
document
|
|- text (<asdf>)
|- abcd
|- text (</asdf>)
What is the simplest method to achieve this?
Upvotes: 2
Views: 116
Reputation: 619
Pretty much any XML parser is going to lose distinctions like whether single or double quotes were used, or CDATA sections were used, or whitespace inside tags (not between tags).
So: <boy socks="black" ></boy> might come back as <boy socks='black'/>
If you want to treat the input as not XML, you'll have to fall back on non-XML tools, or rethink your situation entirely, as this is a very unusual thing to want to do.
It's fairly easy in a text-processing language such as Perl, if you are careful. For example,
perl -p -e 's#<(/?[^:]+[\s>])#\<$1#g'
will go a long way, by changing the < signs you want to treat as text into < instead. This approach actually works best if you read the whole file in Perl rather than (as in this example) a line at a time, so that you can match close tags spread over multiple lines,
</boy
> like this.
But, best to parse XML with an XML parser, not regular expressions, so if the sort of changes I mentioned above are OK, this is really easy to do in XSLT.
Upvotes: 0
Reputation: 8885
Transform the document with xslt first so that the nodes you want treated as text actually are text.
Upvotes: 3