Reputation: 6707
I have RSS feed that I want to modify on fly, all I need is the text (and linefeeds) so everything else must be removed ( All images, styles, links )
How can I do this easily with ASP.NET c#
Upvotes: 2
Views: 1756
Reputation: 536339
Regex cannot parse XML. Do not use regex to parse XML. Do not pass Go. Do not collect £200.
You need a proper XML parser. Load the RSS into an XMLDocument, then use innerText to get only text content.
Note that even when you've extracted the description content from RSS, it can contain active HTML. That is:
<description> <em>Fish</em> &amp; chips </description>
can, when parsed properly as XML then read as text give you either the literal string:
<em>Fish</em> & chips
or, the markup:
Fish & chips
The fun thing about RSS is that you don't really know which is right. In RSS 2.0 it is explicitly HTML markup (the second case); in other versions it's not specified. Generally you should assume that descriptions can contain entity-encoded HTML tags, and if you want to further strip those from the final text you'll need a second parsing step.
(Unfortunately, since this is legacy HTML and not XML it's harder to parse; a regex will be even more useless than it is for parsing XML. There isn't a built-in HTML parser in .NET, but there are third-party libraries such as the HTML Agility Pack.)
Upvotes: 5
Reputation: 119
I did this in JavaScript for a project in much the same way as above:
var thisText = '';
thisText = document.getElementById('textToStrip').value;
var re = new RegExp('<(.|\\n)*?>', 'igm');
thisText = thisText.replace(re, '');
Upvotes: 0
Reputation: 6707
string pattern = @"<(.|\n)*?>";
return Regex.Replace(htmlString, pattern, string.Empty);
Upvotes: 0
Reputation: 23511
Be careful - you don't want to assume that the html you receive is well formed:
public static string ClearHTMLTagsFromString(string htmlString)
{
string regEx = @"\<[^\<\>]*\>";
string tagless = Regex.Replace(htmlString, regEx, string.Empty);
// remove rogue leftovers
tagless = tagless.Replace("<", string.Empty).Replace(">", string.Empty);
return tagless;
}
Upvotes: 0