Reputation: 46480
I have an asp.net web page that has a TinyMCE box. Users can format text and send the HTML to be stored in a database.
On the server, I would like to take strip the html from the text so I can store only the text in a Full Text indexed column for searching.
It's a breeze to strip the html on the client using jQuery's text() function, but I would really rather do this on the server. Are there any existing utilities that I can use for this?
See my answer.
alt text http://tinyurl.com/sillychimp
Upvotes: 11
Views: 9245
Reputation: 61
Check the following example:
TextReader tr = new StreamReader(@"Filepath");
string str = tr.ReadToEnd();
str= Regex.Replace(str,"<(.|\n)*?>", string.Empty);
but you need to have a namespace referenced i.e:
System.Text.RegularExpressions
only take this logic for your website
Upvotes: 4
Reputation: 1613
Take a look at this Strip HTML tags from a string using regular expressions
Upvotes: 8
Reputation: 566
You can use something like this
string strwithouthtmltag;
strwithouthtmltag = Regex.Replace(strWithHTMLTags, "<[^>]*>", string.Empty)
Upvotes: 0
Reputation: 378
You can use HTQL COM, and query the source with a query: <body> &tx;
Upvotes: 0
Reputation: 46480
I downloaded the HtmlAgilityPack and created this function:
string StripHtml(string html)
{
// create whitespace between html elements, so that words do not run together
html = html.Replace(">","> ");
// parse html
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
// strip html decoded text from html
string text = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
// replace all whitespace with a single space and remove leading and trailing whitespace
return Regex.Replace(text, @"\s+", " ").Trim();
}
Upvotes: 13
Reputation: 31593
As you may have malformed HTML in the system: BeautifulSoup or similar could be used.
It is written in Python; I am not sure how it could be interfaced - using the .NET language IronPython?
Upvotes: 0
Reputation: 9771
Here's Jeff Atwood's RefactorMe code link for his Sanitize HTML method
Upvotes: 2
Reputation: 35374
You could:
Upvotes: 0
Reputation: 25349
If you are just storing text for indexing then you probably want to do a bit more than just remove the HTML, such as ignoring stop-words and removing words shorter than (say) 3 characters. However, a simple tag and stripper I once wrote goes something like this:
public static string StripTags(string value)
{
if (value == null)
return string.Empty;
string pattern = @"&.{1,8};";
value = Regex.Replace(value, pattern, " ");
pattern = @"<(.|\n)*?>";
return Regex.Replace(value, pattern, string.Empty);
}
It's old and I'm sure it can be optimised (perhaps using a compiled reg-ex?). But it does work and may help...
Upvotes: 0