Reputation: 52523
So I just got my site kicked off the server today and I think this function is the culprit. Can anyone tell me what the problem is? I can't seem to figure it out:
Public Function CleanText(ByVal str As String) As String
'removes HTML tags and other characters that title tags and descriptions don't like
If Not String.IsNullOrEmpty(str) Then
'mini db of extended tags to get rid of
Dim indexChars() As String = {"<a", "<img", "<input type=""hidden"" name=""tax""", "<input type=""hidden"" name=""handling""", "<span", "<p", "<ul", "<div", "<embed", "<object", "<param"}
For i As Integer = 0 To indexChars.GetUpperBound(0) 'loop through indexchars array
Dim indexOfInput As Integer = 0
Do 'get rid of links
indexOfInput = str.IndexOf(indexChars(i)) 'find instance of indexChar
If indexOfInput <> -1 Then
Dim indexNextLeftBracket As Integer = str.IndexOf("<", indexOfInput) + 1
Dim indexRightBracket As Integer = str.IndexOf(">", indexOfInput) + 1
'check to make sure a right bracket hasn't been left off a tag
If indexNextLeftBracket > indexRightBracket Then 'normal case
str = str.Remove(indexOfInput, indexRightBracket - indexOfInput)
Else
'add the right bracket right before the next left bracket, just remove everything
'in the bad tag
str = str.Insert(indexNextLeftBracket - 1, ">")
indexRightBracket = str.IndexOf(">", indexOfInput) + 1
str = str.Remove(indexOfInput, indexRightBracket - indexOfInput)
End If
End If
Loop Until indexOfInput = -1
Next
End If
Return str
End Function
Upvotes: 0
Views: 328
Reputation: 300559
Wouldn't something like this be simpler? (OK, I know it's not identical to posted code):
public string StripHTMLTags(string text)
{
return Regex.Replace(text, @"<(.|\n)*?>", string.Empty);
}
(Conversion to VB.NET should be trivial!)
Note: if you are running this often, there are two performance improvements you can make to the Regex
.
One is to use a pre-compiled expression which requires re-writing slightly.
The second is to use a non-capturing form of the regular expression; .NET regular expressions implement the (?:) syntax, which allows for grouping to be done without incurring the performance penalty of captured text being remembered as a backreference. Using this syntax, the above regular expression could be changed to:
@"<(?:.|\n)*?>"
Upvotes: 5
Reputation: 415810
This line is also wrong:
Dim indexNextLeftBracket As Integer = str.IndexOf("<", indexOfInput) + 1
It's guaranteed to always set indexNextLeftBracket equal to indexOfInput, because at this point the character at the position referred to by indexOfInput is already always a '<'. Do this instead:
Dim indexNextLeftBracket As Integer = str.IndexOf("<", indexOfInput+1) + 1
And also add a clause to the if statement to make sure your string is long enough for that expression.
Finally, as others have said this code will be a beast to maintain, if you can get it working at all. Best to look for another solution, like a regex or even just replacing all '<' with <
.
Upvotes: 3
Reputation: 45117
In addition to other good answers, you might read up a little on loop invariants a little bit. The pulling out and putting back stuff to the string you check to terminate your loop should set off all manner of alarm bells. :)
Upvotes: 1
Reputation: 85655
That doesn't seem to work for a simplistic <a<a<a
case, or even <a>Test</a>
. Did you test this at all?
Personally, I hate string parsing like this - so I'm not going to even try figuring out where your error is. It'd require a debugger, and more headache than I'm willing to put in.
Upvotes: 0
Reputation: 32367
I'd have to run it through a real compiler but the mindpiler tells me that the str = str.Remove(indexOfInput, indexRightBracket - indexOfInput)
line is re-generating an invalid tag such that when you loop through again it finds the same mistake "fixes" it, tries again, finds the mistake "fixes" it, etc.
FWIW heres a snippet of code that removes unwanted HTML tags from a string (It's in C# but the concept translates)
public static string RemoveTags( string html, params string[] allowList )
{
if( html == null ) return null;
Regex regex = new Regex( @"(?<Tag><(?<TagName>[a-z/]+)\S*?[^<]*?>)",
RegexOptions.Compiled |
RegexOptions.IgnoreCase |
RegexOptions.Multiline );
return regex.Replace(
html,
new MatchEvaluator(
new TagMatchEvaluator( allowList ).Replace ) );
}
MatchEvaluator class
private class TagMatchEvaluator
{
private readonly ArrayList _allowed = null;
public TagMatchEvaluator( string[] allowList )
{
_allowed = new ArrayList( allowList );
}
public string Replace( Match match )
{
if( _allowed.Contains( match.Groups[ "TagName" ].Value ) )
return match.Value;
return "";
}
}
Upvotes: 0
Reputation: 444
What happens if your code tries to clean the string <a
?
As I read it, it finds the indexChar at position 0, but then indexNextLeftBracket and indexRightBracket both equal 0, you fall into the else condition, and then you insert a ">" at position -1, which will presumably insert at the beginning, giving you the string ><a
. The new indexRightBracket then becomes 0, so you delete from position 0 for 0 characters, leaving you with ><a
. Then the code finds the <a
in the code again, and you're off to the races with an infinite memory-consuming loop.
Even if I'm wrong, you need to get yourself some unit tests to reassure yourself that these edge cases work properly. That should also help you find the actual looping code if I'm off-base.
Generally speaking though, even if you fix this particular bug, it's never going to be very robust. Parsing HTML is hard, and HTML blacklists are always going to have holes. For instance, if I really want to get a <input type="hidden" name="tax"
tag in, I'll just write it as <input name="tax" type="hidden"
and your code will ignore it. Your better bet is to get an actual HTML parser involved, and to only allow the (very small) subset of tags that you actually want. Or even better, use some other form of markup, and strip all HTML tags (again using a real HTML parser of some description).
Upvotes: 0
Reputation: 5187
Just a guess, but is this like the culprit? indexOfInput = str.IndexOf(indexChars(i)) 'find instance of indexChar
Per the Microsoft docs, Return Value - The index position of value if that string is found, or -1 if it is not. If value is Empty, the return value is 0.
So perhaps indexOfInput is being set to 0?
Upvotes: 0