Reputation: 309
I have a bit of data I'm extracting from a webpage, and I want to know how I can remove everything between these characters- < and >, while also removing those characters themselves. Here is an example of a string I am getting from a site-
<a>SomeTextHere</a>Moretext<br><tr>SomeText</tr>
I want to have my final result to be-
SomeTextHere MoreText SomeText
Is there a way I can do this quickly and efficiently?
Upvotes: 0
Views: 84
Reputation: 7073
You can use this simple RegEx.
private string StripTagsRegex(string source)
{
return Regex.Replace(source, "<.*?>", string.Empty);
}
For more complex work, use Html Agility Pack, a tool commonly recommended for this. It will take malformed HTML and massage it into XHTML and then a traversable DOM like the XML classes.
Upvotes: 3
Reputation: 6738
It seems like what you're asking for is to replace multiple contiguous html tags with a single space, while codebased's answer would just concatenate what was on either side of the tag
The following will strip actual tags and html comments while preserving everything else (including < and > characters that don't form part of a tag declaration.
private string StripTagsRegex(string source)
{
return Regex.Replace(source, "(</?[a-z][a-z0-9]*[^<>]*>|<!--.*?-->)+", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Multiline).Trim();
}
Using this method:
<a>SomeTextHere</a>Moretext<br><tr>SomeText</tr>
becomes
SomeTextHere MoreText SomeText
which is what I think you were really asking for.
Upvotes: 2