Is there any easy way to remove all HTML tags or ANYTHING HTML related from a string? For example: string title = "<b> Hulk Hogan's Celebrity Championship Wrestling <font color=\"#228b22\">[Proj # 206010]</font></b> (Reality Series, )" The above should really be: "Hulk Hogan's Celebrity Championship Wrestling [Proj # 206010] (Reality Series)"

Reputation: 9970

How do I remove all HTML tags from a string without knowing which tags are in it?

Is there any easy way to remove all HTML tags or ANYTHING HTML related from a string?

For example:

string title = "<b> Hulk Hogan's Celebrity Championship Wrestling &nbsp;&nbsp;&nbsp;<font color=\"#228b22\">[Proj # 206010]</font></b>&nbsp;&nbsp;&nbsp; (Reality Series, &nbsp;)"

The above should really be:

"Hulk Hogan's Celebrity Championship Wrestling [Proj # 206010] (Reality Series)"

Upvotes: 208

Answers (7)

Jaykumar Chaniyara

Reputation: 11

static Regex htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);
   
public static string RemoveHTMLTagsCompiled(string html)
{
 return htmlRegex.Replace(html, string.Empty);
}

Upvotes: 0

Chaan Rahman

Reputation: 1

public static string StripHtml(string input)
    {
        return string.IsNullOrEmpty(input) ? input : System.Web.HttpUtility.HtmlDecode(System.Text.RegularExpressions.Regex.Replace(input, "<.*?>", String.Empty));
    }

Upvotes: 0

Khanbala Rashidov

Reputation: 102

public static string StripHTML(string input)
{
    if (input==null)
    {
        return string.Empty;
    }
    return Regex.Replace(input, "<.*?>", String.Empty);

}

Upvotes: 0

Jeff Qi

Reputation: 41

I built a small function to remove HTML tags.

public static string RemoveHtmlTags(string text)
        {
            List<int> openTagIndexes = Regex.Matches(text, "<").Cast<Match>().Select(m => m.Index).ToList();
            List<int> closeTagIndexes = Regex.Matches(text, ">").Cast<Match>().Select(m => m.Index).ToList();
            if (closeTagIndexes.Count > 0)
            {
                StringBuilder sb = new StringBuilder();
                int previousIndex = 0;
                foreach (int closeTagIndex in closeTagIndexes)
                {
                    var openTagsSubset = openTagIndexes.Where(x => x >= previousIndex && x < closeTagIndex);
                    if (openTagsSubset.Count() > 0 && closeTagIndex - openTagsSubset.Max() > 1 )
                    {
                        sb.Append(text.Substring(previousIndex, openTagsSubset.Max() - previousIndex));
                    }
                    else
                    {
                        sb.Append(text.Substring(previousIndex, closeTagIndex - previousIndex + 1));
                    }
                    previousIndex = closeTagIndex + 1;
                }
                if (closeTagIndexes.Max() < text.Length)
                {
                    sb.Append(text.Substring(closeTagIndexes.Max() + 1));
                }
                return sb.ToString();
            }
            else
            {
                return text;
            }
        }

Upvotes: 1

nicolas

Reputation: 7668

You can use a simple regex like this:

public static string StripHTML(string input)
{
   return Regex.Replace(input, "<.*?>", String.Empty);
}

Be aware that this solution has its own flaw. See Remove HTML tags in String for more information (especially the comments of 'Mark E. Haase'/@mehaase)

Another solution would be to use the HTML Agility Pack.
You can find an example using the library here: HTML agility pack - removing unwanted tags without removing content?

Upvotes: 407

Vinay

Reputation: 735

You can use the below code on your string and you will get the complete string without html part.

string title = "<b> Hulk Hogan's Celebrity Championship Wrestling &nbsp;&nbsp;&nbsp;<font color=\"#228b22\">[Proj # 206010]</font></b>&nbsp;&nbsp;&nbsp; (Reality Series, &nbsp;)".Replace("&nbsp;",string.Empty);            
        string s = Regex.Replace(title, "<.*?>", String.Empty);

Upvotes: 7

ssilas777

Reputation: 9764

You can parse the string using Html Agility pack and get the InnerText.

    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(@"<b> Hulk Hogan's Celebrity Championship Wrestling &nbsp;&nbsp;&nbsp;<font color=\"#228b22\">[Proj # 206010]</font></b>&nbsp;&nbsp;&nbsp; (Reality Series, &nbsp;)");
    string result = htmlDoc.DocumentNode.InnerText;

Upvotes: 87

How do I remove all HTML tags from a string without knowing which tags are in it?

Answers (7)

Related Questions