Grant H.
Grant H.

Reputation: 3717

Replace XML String using Regex or HtmlAgilityPack

First things first: I'm well aware that using Regex to parse XML is a bad idea. That said, this XML is malformed enough that using XML parsers will substantially change the output (at best), and render the output invalid to the engine that consumes it. It is a proprietary specification defined by a third party, I have no control over it.

Given that the typical gotchas with Regex/XML won't be a problem here because of the limited scope, how would one define a regex to capture the following:

<ns:elementname attr="value">
  arbitrary data/child nodes here
</ns:elementname>

I've tried:

var tOut5 = Regex.Replace(entry, 
@"<ns:elementname(.*?)ns:elementname>", 
"", RegexOptions.Multiline);

As well as a few other variants.

With HTMLAgilityPack I've tried:

var doc = new HtmlDocument();
doc.OptionWriteEmptyNodes = true;
doc.LoadHtml(text);
var Elements = doc.DocumentNode.Descendants()
.Where(n => n.Name == "ns:elementname");

Which works for selecting the node, but when saving the output, it affects the way other nodes are rendered as a byproduct.

I'm also open to other suggestions, but please keep in mind that the only part of the overall document that can be altered is this node, and that the XML is too malformed to use with most parsers.

Upvotes: 0

Views: 154

Answers (1)

egandalf
egandalf

Reputation: 898

In a Regex tester, this worked for me. Note the use of SingleLine, which makes the (.) match every character, including newline.

<ns:elementname(.+?)>.+?</ns:elementname>

enter image description here

Upvotes: 1

Related Questions