Hello
Hello

Reputation: 13

Remove CDATA from the input

I get a string which has CDATA and I want to remove that.

Input : "<Text><![CDATA[Hello]]></Text><Text><![CDATA[World]]></Text>"
Output I want : <text>Hello</text> 
              <text>World</text>

I want to take all data between <text> and </text> and add it to a list.

The code I try is :

private List<XElement> Foo(string input)
{
    string pattern = "<text>(.*?)</text>";
    input = "<Text><![CDATA[Hello]]></Text><Text><![CDATA[World]]></Text>" //For Testing
    var matches = Regex.Matches(input, pattern, RegexOptions.IgnoreCase);
    var a = matches.Cast<Match>().Select(m => m.Groups[1].Value.Trim()).ToArray();

    List<XElement> li = new List<XElement>();
    XElement xText;
    for (int i = 0; i < a.Length; i++)
    {
        xText = new XElement("text");
        xText.Add(System.Net.WebUtility.HtmlDecode(a[i]));
        li.Add(xText);
    }
    return li;
} 

But, Here I get output as :

<text>&lt;![CDATA[Hello]]&gt;</text>
<text>&lt;![CDATA[World]]&gt;</text>

Can anyone please help me up.

Upvotes: 1

Views: 3338

Answers (2)

Jon Skeet
Jon Skeet

Reputation: 1503489

It seems to me that you shouldn't be using a regular expression at all. Instead, construct a valid XML document be wrapping it all in a root element, then parse it and extract the elements you want.

You also want to replace all CDATA nodes with their equivalent text nodes. You can do that before or after you extract the elements into a list, but I've chosen to do it before:

using System;
using System.Linq;
using System.Xml.Linq;

class Test
{
    static void Main()
    {
        string input = "<Text><![CDATA[Hello]]></Text><Text><![CDATA[World]]></Text>";
        string xml = "<root>" + input + "</root>";
        var doc = XDocument.Parse(xml);
        var nodes = doc.DescendantNodes().OfType<XCData>().ToList();
        foreach (var node in nodes)
        {
            node.ReplaceWith(new XText(node.Value));
        }
        var elements = doc.Root.Elements().ToList();
        elements.ForEach(Console.WriteLine);
    }
}

Upvotes: 5

C1rdec
C1rdec

Reputation: 1687

I would use XDocument instead of Regex:

var value = "<root><Text><![CDATA[Hello]]></Text><Text><![CDATA[World]]></Text></root>";
var doc = XDocument.Parse(value);
Console.WriteLine (doc.Root.Elements().ElementAt(0).Value);
Console.WriteLine (doc.Root.Elements().ElementAt(1).Value);

Ouput:

Hello World

Upvotes: 0

Related Questions