Reputation: 9970
Is there any easy way to remove all HTML tags or ANYTHING HTML related from a string?
For example:
string title = "<b> Hulk Hogan's Celebrity Championship Wrestling <font color=\"#228b22\">[Proj # 206010]</font></b> (Reality Series, )"
The above should really be:
"Hulk Hogan's Celebrity Championship Wrestling [Proj # 206010] (Reality Series)"
Upvotes: 208
Views: 313897
Reputation: 11
static Regex htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);
public static string RemoveHTMLTagsCompiled(string html)
{
return htmlRegex.Replace(html, string.Empty);
}
Upvotes: 0
Reputation: 1
public static string StripHtml(string input)
{
return string.IsNullOrEmpty(input) ? input : System.Web.HttpUtility.HtmlDecode(System.Text.RegularExpressions.Regex.Replace(input, "<.*?>", String.Empty));
}
Upvotes: 0
Reputation: 102
public static string StripHTML(string input)
{
if (input==null)
{
return string.Empty;
}
return Regex.Replace(input, "<.*?>", String.Empty);
}
Upvotes: 0
Reputation: 41
I built a small function to remove HTML tags.
public static string RemoveHtmlTags(string text)
{
List<int> openTagIndexes = Regex.Matches(text, "<").Cast<Match>().Select(m => m.Index).ToList();
List<int> closeTagIndexes = Regex.Matches(text, ">").Cast<Match>().Select(m => m.Index).ToList();
if (closeTagIndexes.Count > 0)
{
StringBuilder sb = new StringBuilder();
int previousIndex = 0;
foreach (int closeTagIndex in closeTagIndexes)
{
var openTagsSubset = openTagIndexes.Where(x => x >= previousIndex && x < closeTagIndex);
if (openTagsSubset.Count() > 0 && closeTagIndex - openTagsSubset.Max() > 1 )
{
sb.Append(text.Substring(previousIndex, openTagsSubset.Max() - previousIndex));
}
else
{
sb.Append(text.Substring(previousIndex, closeTagIndex - previousIndex + 1));
}
previousIndex = closeTagIndex + 1;
}
if (closeTagIndexes.Max() < text.Length)
{
sb.Append(text.Substring(closeTagIndexes.Max() + 1));
}
return sb.ToString();
}
else
{
return text;
}
}
Upvotes: 1
Reputation: 7668
You can use a simple regex like this:
public static string StripHTML(string input)
{
return Regex.Replace(input, "<.*?>", String.Empty);
}
Be aware that this solution has its own flaw. See Remove HTML tags in String for more information (especially the comments of 'Mark E. Haase'/@mehaase)
Another solution would be to use the HTML Agility Pack.
You can find an example using the library here: HTML agility pack - removing unwanted tags without removing content?
Upvotes: 407
Reputation: 735
You can use the below code on your string and you will get the complete string without html part.
string title = "<b> Hulk Hogan's Celebrity Championship Wrestling <font color=\"#228b22\">[Proj # 206010]</font></b> (Reality Series, )".Replace(" ",string.Empty);
string s = Regex.Replace(title, "<.*?>", String.Empty);
Upvotes: 7
Reputation: 9764
You can parse the string using Html Agility pack and get the InnerText.
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(@"<b> Hulk Hogan's Celebrity Championship Wrestling <font color=\"#228b22\">[Proj # 206010]</font></b> (Reality Series, )");
string result = htmlDoc.DocumentNode.InnerText;
Upvotes: 87