Reputation: 1249
I have a string with some html code. However I need to parse that html to a XDocument
.
string input = String.Concat("<root>", htmlString, "</root>");
var doc = XDocument.Parse(input);
But sometimes in my htmlString
there is tags like <o:p></o:p>
, for example, and with that in XDocument.Parse()
I got the exception:
The ':' character, hexadecimal value 0x3A, cannot be included in a name. Line 1, position 650.
How can I remove that tags or at least replace the ':'
in the tag name?
Before doing the parse I'm trying to remove/replace the ':'
but it isn't working:
try
{
Regex regex = new Regex(@"<[:][^>]+>.+?</\[:]>");
while (regex.IsMatch(htmlString))
{
htmlString= regex.Replace(htmlString, "");
}
}
catch { }
HTML example
<p>Some text</p>
<p class="MsoNormal" style="TEXT-ALIGN: justify; MARGIN: 0cm 0cm 0pt; LINE-HEIGHT: 150%">
<?xml:namespace prefix="o" ns="urn:schemas-microsoft-com:office:office"?>
<o:p> </o:p>
</p>
<p>More text</p>
UPDATE
I'm using HtmlAgilityPack
but it doesn't remove this tags.
My code
ConfigureHtmlDocument();
var htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(htmlString);
var htmlError = htmlDoc.ParseErrors.SafeAny();
if (!htmlError)
htmlString= htmlDoc.DocumentNode.InnerHtml;
try
{
Regex regex = new Regex(@"<[:][^>]+>.+?</\[:]>");
while (regex.IsMatch(htmlString))
{
htmlString= regex.Replace(htmlString, "");
}
}
catch { }
string input = String.Concat("<root>", htmlString, "</root>");
var doc = XDocument.Parse(input);
//more code
ConfigureHtmlDocument()
if (!HtmlNode.ElementsFlags.ContainsKey("p"))
HtmlNode.ElementsFlags.Add("p", HtmlElementFlag.Closed);
else
HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed;
if (!HtmlNode.ElementsFlags.ContainsKey("ul"))
HtmlNode.ElementsFlags.Add("ul", HtmlElementFlag.Closed);
else
HtmlNode.ElementsFlags["ul"] = HtmlElementFlag.Closed;
if (!HtmlNode.ElementsFlags.ContainsKey("li"))
HtmlNode.ElementsFlags.Add("li", HtmlElementFlag.Closed);
else
HtmlNode.ElementsFlags["li"] = HtmlElementFlag.Closed;
if (!HtmlNode.ElementsFlags.ContainsKey("ol"))
HtmlNode.ElementsFlags.Add("ol", HtmlElementFlag.Closed);
else
HtmlNode.ElementsFlags["ol"] = HtmlElementFlag.Closed;
//more similar code
Upvotes: 3
Views: 2644
Reputation: 1249
Solved! The Regex expression is wrong. I replaced the expression with this:
//for remove xml declarations
htmlString = Regex.Replace(texto, @"<\?xml.*\?>", "");
//for remove custom tags like <o:p> and </o:p>
htmlString = Regex.Replace(texto, @"<(?:[\S]\:[\S])[^>]*>", "");
htmlString = Regex.Replace(texto, @"</(?:[\S]\:[\S])[^>]*>", "");
And now it works!
Upvotes: 1
Reputation: 415
If you know the namespace in advance the you can do something simple like this:
htmlString = htmlString.Replace("<o:", "<").Replace("</o:", "</");
Upvotes: 0