Reputation:
I have a program I'm writing that is supposed to strip html tags out of a string. I've been trying to replace all strings that start with "<" and end with ">". This (obviously because I'm here asking this) has not worked so far. Here's what I've tried:
StrippedContent = Regex.Replace(StrippedContent, "\<.*\>", "")
That just returns what seems like a random part of the original string. I've also tried
For Each StringMatch As Match In Regex.Matches(StrippedContent, "\<.*\>")
StrippedContent = StrippedContent.Replace(StringMatch.Value, "")
Next
Which did the same thing (returns what seems like a random part of the original string). Is there a better way to do this? By better I mean a way that works.
Upvotes: 7
Views: 20743
Reputation: 525
Here's a simple function using the regex pattern that Ro Yo Mi posted.
<Extension()> Public Function RemoveHtmlTags(value As String) As String
Return Regex.Replace(value, "<(?:[^>=]|='[^']*'|=""[^""]*""|=[^'""][^\s>]*)*>", "")
End Function
Demonstration:
Dim html As String = "This <i>is</i> just a <b>demo</b>.".RemoveHtmlTags()
Console.WriteLine(html)
Upvotes: 1
Reputation: 15010
This expression will:
Regex: <(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>
Replace with: nothing
Sample Text
Note the difficult edge case in the mouse over function
these are <a onmouseover=' href="NotYourHref" ; if (6/a>3) { funRotator(href) } ; ' href=abc.aspx?filter=3&prefix=&num=11&suffix=>the droids</a> you are looking for.
Code
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
Dim sourcestring as String = "replace with your source string"
Dim replacementstring as String = ""
Dim matchpattern as String = "<(?:[^>=]|='[^']*'|=""[^""]*""|=[^'""][^\s>]*)*>"
Console.Writeline(regex.Replace(sourcestring,matchpattern,replacementstring,RegexOptions.IgnoreCase OR RegexOptions.IgnorePatternWhitespace OR RegexOptions.Multiline OR RegexOptions.Singleline))
End Sub
End Module
String after replacement
these are the droids you are looking for.
Upvotes: 32
Reputation:
Well, this proves that you should always search Google for an answer. Here's a method I got from http://www.dotnetperls.com/remove-html-tags-vbnet
Imports System.Text.RegularExpressions
Module Module1
Sub Main()
Dim html As String = "<p>There was a <b>.NET</b> programmer " +
"and he stripped the <i>HTML</i> tags.</p>"
Dim tagless As String = StripTags(html)
Console.WriteLine(tagless)
End Sub
Function StripTags(ByVal html As String) As String
Return Regex.Replace(html, "<.*?>", "")
End Function
End Module
Upvotes: 4