Ninita
Ninita

Reputation: 1249

Using C# to remove custom xml tags from html

I have a string with some html code. However I need to parse that html to a XDocument.

string input = String.Concat("<root>", htmlString, "</root>");
var doc = XDocument.Parse(input);

But sometimes in my htmlString there is tags like <o:p></o:p>, for example, and with that in XDocument.Parse() I got the exception:

The ':' character, hexadecimal value 0x3A, cannot be included in a name. Line 1, position 650.

How can I remove that tags or at least replace the ':' in the tag name?

Before doing the parse I'm trying to remove/replace the ':' but it isn't working:

try
{
    Regex regex = new Regex(@"<[:][^>]+>.+?</\[:]>");
    while (regex.IsMatch(htmlString))
    {
        htmlString= regex.Replace(htmlString, "");
    }
}
catch { }

HTML example

<p>Some text</p>

<p class="MsoNormal" style="TEXT-ALIGN: justify; MARGIN: 0cm 0cm 0pt; LINE-HEIGHT: 150%">
    <?xml:namespace prefix="o" ns="urn:schemas-microsoft-com:office:office"?>
    <o:p> </o:p>
</p>

<p>More text</p>

UPDATE

I'm using HtmlAgilityPack but it doesn't remove this tags.

My code

ConfigureHtmlDocument();

var htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(htmlString);

var htmlError = htmlDoc.ParseErrors.SafeAny();

if (!htmlError)
    htmlString= htmlDoc.DocumentNode.InnerHtml;

try
{
    Regex regex = new Regex(@"<[:][^>]+>.+?</\[:]>");
    while (regex.IsMatch(htmlString))
    {
        htmlString= regex.Replace(htmlString, "");
    }
}
catch { }

string input = String.Concat("<root>", htmlString, "</root>");
var doc = XDocument.Parse(input);

//more code

ConfigureHtmlDocument()

    if (!HtmlNode.ElementsFlags.ContainsKey("p"))
        HtmlNode.ElementsFlags.Add("p", HtmlElementFlag.Closed);
    else
        HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed;

    if (!HtmlNode.ElementsFlags.ContainsKey("ul"))
        HtmlNode.ElementsFlags.Add("ul", HtmlElementFlag.Closed);
    else
        HtmlNode.ElementsFlags["ul"] = HtmlElementFlag.Closed;

    if (!HtmlNode.ElementsFlags.ContainsKey("li"))
        HtmlNode.ElementsFlags.Add("li", HtmlElementFlag.Closed);
    else
        HtmlNode.ElementsFlags["li"] = HtmlElementFlag.Closed;

    if (!HtmlNode.ElementsFlags.ContainsKey("ol"))
        HtmlNode.ElementsFlags.Add("ol", HtmlElementFlag.Closed);
    else
        HtmlNode.ElementsFlags["ol"] = HtmlElementFlag.Closed;

    //more similar code

Upvotes: 3

Views: 2644

Answers (2)

Ninita
Ninita

Reputation: 1249

Solved! The Regex expression is wrong. I replaced the expression with this:

//for remove xml declarations
htmlString = Regex.Replace(texto, @"<\?xml.*\?>", "");

//for remove custom tags like <o:p> and </o:p>
htmlString = Regex.Replace(texto, @"<(?:[\S]\:[\S])[^>]*>", "");
htmlString = Regex.Replace(texto, @"</(?:[\S]\:[\S])[^>]*>", ""); 

And now it works!

Upvotes: 1

CooncilWorker
CooncilWorker

Reputation: 415

If you know the namespace in advance the you can do something simple like this:

htmlString = htmlString.Replace("<o:", "<").Replace("</o:", "</");

Upvotes: 0

Related Questions