Reputation: 14907
Here's a weird .NET regex problem that I can't figure out. I'm trying to unparse some HTML in my forum app. I haven't changed the code, but in certain environments, the regex simply never returns. I can reproduce it in the app:
line 66: https://github.com/POPWorldMedia/POPForums/blob/master/PopForums/Services/TextParsingService.cs
text = Regex.Replace(text, @"(<iframe )(\S+ )*(src=""http://www.youtube.com/embed/)(\S+)("")( *\S+)*( */iframe>)", "http://www.youtube.com/watch?v=$4", RegexOptions.IgnoreCase);
The input string it's choking on is:
<p>This is an <strong>important</strong> <em>preview</em> of a post.</p>[quote]<p>This is a quote.<br /></p>[/quote]<p><iframe width="640" height="360" src="http://www.youtube.com/embed/Zey3WWThErw" frameborder="0" allowfullscreen></iframe></p><p>O look! YouTube!</p>
It will eventually time out here: http://regexlib.com/RETester.aspx
The host process, IIS in this case, goes to about 50% locally (one core, I assume) and never lets go or returns. I'm completely stumped. The same code is running on one of my sites in Azure and it doesn't choke there.
Upvotes: 1
Views: 149
Reputation:
Your only problem is this ( \ * \S+ )*
.
Engines get particularly annoyed mixing (zero/many*
with many+
)* inside a zero/many group
.
Singularize the many in that case and problem is solved. ie: ( _* _+)*
to => ( _* _)*
These are the only places where it causes problems, especially when the many could match a lot
of different characters.
Always look to check that first, and don't get paranoid about backtracking.
# @"(<iframe\ )(\S+\ )*(src=""http://www\.youtube\.com/embed/)(\S+)("")(\ *\S)*(\ */iframe>)"
( <iframe\ ) # (1)
( \S+ \ )* # (2)
( # (3 start)
src="http://www \. youtube \. com/embed/
) # (3 end)
( \S+ ) # (4)
( " ) # (5)
( \ * \S )* # (6)
( \ */iframe> ) # (7)
Output:
** Grp 0 - ( pos 119 , len 121 )
<iframe width="640" height="360" src="http://www.youtube.com/embed/Zey3WWThErw" frameborder="0" allowfullscreen></iframe>
** Grp 1 - ( pos 119 , len 8 )
<iframe
** Grp 2 - ( pos 139 , len 13 )
height="360"
** Grp 3 - ( pos 152 , len 34 )
src="http://www.youtube.com/embed/
** Grp 4 - ( pos 186 , len 11 )
Zey3WWThErw
** Grp 5 - ( pos 197 , len 1 )
"
** Grp 6 - ( pos 231 , len 1 )
<
** Grp 7 - ( pos 232 , len 8 )
/iframe>
Upvotes: 1
Reputation: 24078
The (\S+ )*
and ( *\S+)*
parts cause a lot of backtracking.
Consider replacing them simply by .*
. It is not 100% equivalent, but I think it should work with what I feel like you are trying to do.
text = Regex.Replace(text, @"(<iframe )(.)*(src=""http://www.youtube.com/embed/)(\S+)("")(.)*( */iframe>)", "http://www.youtube.com/watch?v=$4", RegexOptions.IgnoreCase);
You'll have other problems with that regex since it maches greedily. You might want to try this instead to make sure you don't have any problems if there are several iframe
tags within your text.
text = Regex.Replace(text, @"(<iframe )(.)*?(src=""http://www.youtube.com/embed/)(\S+)("")(.)*?( */iframe>)", "http://www.youtube.com/watch?v=$4", RegexOptions.IgnoreCase);
As always, you should also consider using an HTML parser instead of regex for this kind of task.
Upvotes: 3