Ronnie Overby
Ronnie Overby

Reputation: 46480

How Can I strip HTML from Text in .NET?

I have an asp.net web page that has a TinyMCE box. Users can format text and send the HTML to be stored in a database.

On the server, I would like to take strip the html from the text so I can store only the text in a Full Text indexed column for searching.

It's a breeze to strip the html on the client using jQuery's text() function, but I would really rather do this on the server. Are there any existing utilities that I can use for this?

EDIT

See my answer.

EDIT 2

alt text http://tinyurl.com/sillychimp

Upvotes: 11

Views: 9245

Answers (9)

Muhammad Hamayoon
Muhammad Hamayoon

Reputation: 61

Check the following example:

TextReader tr = new StreamReader(@"Filepath");
string str = tr.ReadToEnd();     
str= Regex.Replace(str,"<(.|\n)*?>", string.Empty);

but you need to have a namespace referenced i.e:

System.Text.RegularExpressions

only take this logic for your website

Upvotes: 4

riotera
riotera

Reputation: 1613

Take a look at this Strip HTML tags from a string using regular expressions

Upvotes: 8

Nirlep
Nirlep

Reputation: 566

You can use something like this

string strwithouthtmltag;    
strwithouthtmltag = Regex.Replace(strWithHTMLTags, "<[^>]*>", string.Empty)

Upvotes: 0

seagulf
seagulf

Reputation: 378

You can use HTQL COM, and query the source with a query: <body> &tx;

Upvotes: 0

Ronnie Overby
Ronnie Overby

Reputation: 46480

I downloaded the HtmlAgilityPack and created this function:

string StripHtml(string html)
{
    // create whitespace between html elements, so that words do not run together
    html = html.Replace(">","> ");

    // parse html
    var doc = new HtmlAgilityPack.HtmlDocument();   
    doc.LoadHtml(html);

    // strip html decoded text from html
    string text = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);   

    // replace all whitespace with a single space and remove leading and trailing whitespace
    return Regex.Replace(text, @"\s+", " ").Trim();
}

Upvotes: 13

Peter Mortensen
Peter Mortensen

Reputation: 31593

As you may have malformed HTML in the system: BeautifulSoup or similar could be used.

It is written in Python; I am not sure how it could be interfaced - using the .NET language IronPython?

Upvotes: 0

Tristan Warner-Smith
Tristan Warner-Smith

Reputation: 9771

Here's Jeff Atwood's RefactorMe code link for his Sanitize HTML method

Upvotes: 2

richardtallent
richardtallent

Reputation: 35374

You could:

  • Use a plain old TEXTAREA (styled for height/width/font/etc.) rather than TinyMCE.
  • Use TinyMCE's built-in configuration options for stripping unwanted HTML.
  • Use HtmlDecode(RegEx.Replace(mystring, "<[^>]+>", "")) on the server.

Upvotes: 0

Dan Diplo
Dan Diplo

Reputation: 25349

If you are just storing text for indexing then you probably want to do a bit more than just remove the HTML, such as ignoring stop-words and removing words shorter than (say) 3 characters. However, a simple tag and stripper I once wrote goes something like this:

    public static string StripTags(string value)
    {
        if (value == null)
            return string.Empty;

        string pattern = @"&.{1,8};";
        value = Regex.Replace(value, pattern, " ");
        pattern = @"<(.|\n)*?>";
        return Regex.Replace(value, pattern, string.Empty);
    }

It's old and I'm sure it can be optimised (perhaps using a compiled reg-ex?). But it does work and may help...

Upvotes: 0

Related Questions