Tom
Tom

Reputation: 6707

Strip away all HTML tags and formatting (RegEx)

I have RSS feed that I want to modify on fly, all I need is the text (and linefeeds) so everything else must be removed ( All images, styles, links )

How can I do this easily with ASP.NET c#

Upvotes: 2

Views: 1756

Answers (4)

bobince
bobince

Reputation: 536339

Regex cannot parse XML. Do not use regex to parse XML. Do not pass Go. Do not collect £200.

You need a proper XML parser. Load the RSS into an XMLDocument, then use innerText to get only text content.

Note that even when you've extracted the description content from RSS, it can contain active HTML. That is:

<description> &lt;em&gt;Fish&lt;/em&gt; &amp;amp; chips </description>

can, when parsed properly as XML then read as text give you either the literal string:

<em>Fish</em> &amp; chips

or, the markup:

Fish & chips

The fun thing about RSS is that you don't really know which is right. In RSS 2.0 it is explicitly HTML markup (the second case); in other versions it's not specified. Generally you should assume that descriptions can contain entity-encoded HTML tags, and if you want to further strip those from the final text you'll need a second parsing step.

(Unfortunately, since this is legacy HTML and not XML it's harder to parse; a regex will be even more useless than it is for parsing XML. There isn't a built-in HTML parser in .NET, but there are third-party libraries such as the HTML Agility Pack.)

Upvotes: 5

Paul Herzberg
Paul Herzberg

Reputation: 119

I did this in JavaScript for a project in much the same way as above:

var thisText = '';
thisText = document.getElementById('textToStrip').value;
var re = new RegExp('<(.|\\n)*?>', 'igm');
thisText = thisText.replace(re, '');

Upvotes: 0

Tom
Tom

Reputation: 6707

string pattern = @"<(.|\n)*?>";
return Regex.Replace(htmlString, pattern, string.Empty);

Upvotes: 0

teedyay
teedyay

Reputation: 23511

Be careful - you don't want to assume that the html you receive is well formed:

public static string ClearHTMLTagsFromString(string htmlString)
{
    string regEx = @"\<[^\<\>]*\>";
    string tagless = Regex.Replace(htmlString, regEx, string.Empty);

    // remove rogue leftovers
    tagless = tagless.Replace("<", string.Empty).Replace(">", string.Empty);

    return tagless;
}

Upvotes: 0

Related Questions