Jeff Putz
Jeff Putz

Reputation: 14907

Hanging .NET regex with high CPU

Here's a weird .NET regex problem that I can't figure out. I'm trying to unparse some HTML in my forum app. I haven't changed the code, but in certain environments, the regex simply never returns. I can reproduce it in the app:

line 66: https://github.com/POPWorldMedia/POPForums/blob/master/PopForums/Services/TextParsingService.cs

text = Regex.Replace(text, @"(<iframe )(\S+ )*(src=""http://www.youtube.com/embed/)(\S+)("")( *\S+)*( */iframe>)", "http://www.youtube.com/watch?v=$4", RegexOptions.IgnoreCase);

The input string it's choking on is:

<p>This is an <strong>important</strong> <em>preview</em> of a post.</p>[quote]<p>This is a quote.<br /></p>[/quote]<p><iframe width="640" height="360" src="http://www.youtube.com/embed/Zey3WWThErw" frameborder="0" allowfullscreen></iframe></p><p>O look! YouTube!</p>

It will eventually time out here: http://regexlib.com/RETester.aspx

The host process, IIS in this case, goes to about 50% locally (one core, I assume) and never lets go or returns. I'm completely stumped. The same code is running on one of my sites in Azure and it doesn't choke there.

Upvotes: 1

Views: 149

Answers (2)

user557597
user557597

Reputation:

Your only problem is this ( \ * \S+ )*.

Engines get particularly annoyed mixing (zero/many* with many+)* inside a zero/many group.
Singularize the many in that case and problem is solved. ie: ( _* _+)* to => ( _* _)*
These are the only places where it causes problems, especially when the many could match a lot
of different characters.

Always look to check that first, and don't get paranoid about backtracking.

 # @"(<iframe\ )(\S+\ )*(src=""http://www\.youtube\.com/embed/)(\S+)("")(\ *\S)*(\ */iframe>)"

 ( <iframe\  )                            # (1)
 ( \S+ \  )*                              # (2)
 (                                        # (3 start)
      src="http://www \. youtube \. com/embed/ 
 )                                        # (3 end)
 ( \S+ )                                  # (4)
 ( " )                                    # (5)
 ( \ * \S )*                              # (6)
 ( \ */iframe> )                          # (7)

Output:

 **  Grp 0 -  ( pos 119 , len 121 ) 
<iframe width="640" height="360" src="http://www.youtube.com/embed/Zey3WWThErw" frameborder="0" allowfullscreen></iframe>  
 **  Grp 1 -  ( pos 119 , len 8 ) 
<iframe   
 **  Grp 2 -  ( pos 139 , len 13 ) 
height="360"   
 **  Grp 3 -  ( pos 152 , len 34 ) 
src="http://www.youtube.com/embed/  
 **  Grp 4 -  ( pos 186 , len 11 ) 
Zey3WWThErw  
 **  Grp 5 -  ( pos 197 , len 1 ) 
"  
 **  Grp 6 -  ( pos 231 , len 1 ) 
<  
 **  Grp 7 -  ( pos 232 , len 8 ) 
/iframe>  

Upvotes: 1

dee-see
dee-see

Reputation: 24078

The (\S+ )* and ( *\S+)* parts cause a lot of backtracking.

Consider replacing them simply by .*. It is not 100% equivalent, but I think it should work with what I feel like you are trying to do.

text = Regex.Replace(text, @"(<iframe )(.)*(src=""http://www.youtube.com/embed/)(\S+)("")(.)*( */iframe>)", "http://www.youtube.com/watch?v=$4", RegexOptions.IgnoreCase);

You'll have other problems with that regex since it maches greedily. You might want to try this instead to make sure you don't have any problems if there are several iframe tags within your text.

text = Regex.Replace(text, @"(<iframe )(.)*?(src=""http://www.youtube.com/embed/)(\S+)("")(.)*?( */iframe>)", "http://www.youtube.com/watch?v=$4", RegexOptions.IgnoreCase);

As always, you should also consider using an HTML parser instead of regex for this kind of task.

Upvotes: 3

Related Questions