TheUnrealMegashark
TheUnrealMegashark

Reputation: 309

Remove everything in a string inbetween two characters

I have a bit of data I'm extracting from a webpage, and I want to know how I can remove everything between these characters- < and >, while also removing those characters themselves. Here is an example of a string I am getting from a site-

<a>SomeTextHere</a>Moretext<br><tr>SomeText</tr>

I want to have my final result to be-

SomeTextHere MoreText SomeText

Is there a way I can do this quickly and efficiently?

Upvotes: 0

Views: 84

Answers (2)

codebased
codebased

Reputation: 7073

You can use this simple RegEx.

private string StripTagsRegex(string source) 
{
            return Regex.Replace(source, "<.*?>", string.Empty);
}

For more complex work, use Html Agility Pack, a tool commonly recommended for this. It will take malformed HTML and massage it into XHTML and then a traversable DOM like the XML classes.

Upvotes: 3

Iain Fraser
Iain Fraser

Reputation: 6738

It seems like what you're asking for is to replace multiple contiguous html tags with a single space, while codebased's answer would just concatenate what was on either side of the tag

The following will strip actual tags and html comments while preserving everything else (including < and > characters that don't form part of a tag declaration.

private string StripTagsRegex(string source) 
{
    return Regex.Replace(source, "(</?[a-z][a-z0-9]*[^<>]*>|<!--.*?-->)+", " ", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Multiline).Trim();
}

Using this method:

<a>SomeTextHere</a>Moretext<br><tr>SomeText</tr>

becomes

SomeTextHere MoreText SomeText

which is what I think you were really asking for.

Upvotes: 2

Related Questions