orom
orom

Reputation: 861

Preserve self-closing tags on extraction

Consider the following example:

<case>
   <outer>
    <inner>test</inner>
    <inner>test &amp; test <br /><br />test</inner>
    <inner></inner>
   </outer>
</case>

I would like to extract the string enclosed within the second inner element while preserving br tags (or preferably getting them as \n), but decoding all the HTML encoded characters. That is, I would like to get:

"test & test \n\ntest"

or

"test & test <br /><br />test"

So far I have tried the following. It seems to decode the HTML encoded chars but removes
tags completely.

    XDocument xDoc = XDocument.Load(file);
    XNamespace ns = XNamespace.Get("http://www.w3.org/1999/xhtml");
    var cas = xDoc.Descendants().First(e => e.Name.Equals(ns.GetName("case")));
    foreach (var row in cas.Elements())
    {
        var columnVals = row.Elements(ns.GetName("inner")).Select(e => e.Value);
        string str = columnValues.Skip(1).First();
        // str == "test & test test"
        // but i want:
        // "test & test \n\ntest" or "test & test <br /><br />test"
    }

Upvotes: 0

Views: 187

Answers (1)

Baldrick
Baldrick

Reputation: 11840

Try the following:

XDocument xDoc = XDocument.Load(file);
XNamespace ns = XNamespace.Get("http://www.w3.org/1999/xhtml");
var cas = xDoc.Descendants().First(e => e.Name.Equals(ns.GetName("case")));
foreach (var row in cas.Elements())
{
    var columnVals = row.Elements(ns.GetName("inner")).Select(e => e.Nodes());
    var str = columnVals.Skip(1).First();
    var stringResult = WebUtility.HtmlDecode(string.Join(" ", str));
}

It gets the nodes as strings, but decodes any HTML escaping.

The output is:

test & test  <br /> <br /> test

Upvotes: 1

Related Questions