Reputation: 8180
I have found that certain inputs cause the iteration over the matches to have bad performance characteristics (non linear). Has anyone come across this? Would you consider this to be a bug?
At the moment I can only reproduce the problem with a single particular set of inputs, I am hoping to generalize this and understand what is going on.
The following program demonstrates the problem and plots a performance graph showing the nonlinear characteristic for these specially chosen inputs.
output is follows:
6500 chars: 2750 ms: ||
7000 chars: 4327 ms: ||||
7500 chars: 5918 ms: |||||
8000 chars: 8017 ms: ||||||||
8500 chars: 12263 ms: ||||||||||||
9000 chars: 20525 ms: ||||||||||||||||||||
9500 chars: 32392 ms: ||||||||||||||||||||||||||||||||
10000 chars: 43512 ms: |||||||||||||||||||||||||||||||||||||||||||
10500 chars: 55601 ms: |||||||||||||||||||||||||||||||||||||||||||||||||||||||
...
13000 chars: 183131 ms ...
So a doubling of input length has caused a 66.59 fold increase in processing time.
c# code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Diagnostics;
namespace RegExTest
{
class Program
{
static string html;
static void Main(string[] args)
{
PrepareString();
Console.WriteLine("text length: " + html.Length);
string bad = @"<a.*?href=[""'](?<url>.*?)[""'].*?>(?<name>.*?)</a>";
TestRegex(bad);
Console.WriteLine("done! " + bad);
Console.Read();
}
private static void TestRegex(string regex)
{
int i = 0;
int len = 6500;
Stopwatch sw = new Stopwatch();
while (i < html.Length - len)
{
var text = html.Substring(i, len);
sw.Reset();
sw.Start();
Console.Write(len.ToString().PadLeft(8) + " chars: ");
var matches = Regex.Matches(text, regex, RegexOptions.IgnoreCase);
int c = 0;
foreach (var match in matches)
{
c++;
}
var secs = sw.ElapsedMilliseconds / 1000;
Console.Write(sw.ElapsedMilliseconds.ToString().PadLeft(8) + " ms: ");
for (var ii = 0; ii < secs; ++ii)
{
Console.Write("|");
}
if (secs > 0) Console.WriteLine("");
//i += len;
len += 500;
}
}
private static void PrepareString()
{
html = @"</entry><entry><id>tag:blogger.com,1999:blog-2521995507770407606.post-4673830730577756707</id><published>2011-05-10T11:39:00.001-03:00</published><updated>2011-05-10T12:12:50.022-03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='music maker'/><title type='text'>Music Maker - O Sucesso Da Semana (09.05.2011)</title><content type='html'><div class=""separator"" style=""clear: both; text-align: center;""><a href=""http://2.bp.blogspot.com/-TASUdjWaPTk/TclDogdGBoI/AAAAAAAAAMA/-BCEmy3eylo/s1600/Adriana_Calcanhotto_2203%2521.jpg"" imageanchor=""1"" style=""margin-left: 1em; margin-right: 1em;""><img border=""0"" height=""284"" src=""http://2.bp.blogspot.com/-TASUdjWaPTk/TclDogdGBoI/AAAAAAAAAMA/-BCEmy3eylo/s320/Adriana_Calcanhotto_2203%2521.jpg"" width=""320"" />&nbsp;</a></div><div class=""separator"" style=""clear: both; text-align: justify;"">Adriana Calcanhotto ?? uma artista de mil faces. Isso todos concordamos. Depois de voar na charmosa MPB, viajar no mundo l??dico de Partimpim (amada por crian??as e aprovadas pelos adultos), Adriana volta com um projeto que segundo ela se resumira somente ao CD. Nada de shows. Tristeza para os f??s, alegria para a discografia brasileira. Nesse ??lbum ???O Micr??bio do Samba??? e consagrada a grande nomes do samba de que est??o no DNA de sua musicalidade: Ismael Silva, Noel Rosa e Cartola. Da?? o t??tulo do disco que rima com alegria que rima com melancolia. Prazer com dor. E dessa obra prima selecionamos a faixa ???Ja Reparo???, voc?? que pode conferir durante toda semana no Music Maker - A Musica Da Semana. </div><div class=""blogger-post-footer""><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2521995507770407606-4673830730577756707?l=radiosantaclarafm.blogspot.com' alt='' /></div></content><link rel='replies' type='application/atom+xml' href='http://radiosantaclarafm.blogspot.com/feeds/4673830730577756707/comments/default' title='Postar coment??rios'/><link rel='replies' type='text/html' href='http://radiosantaclarafm.blogspot.com/2011/05/music-maker-o-sucesso-da-semana_10.html#comment-form' title='0 Coment??rios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2521995507770407606/posts/default/4673830730577756707'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2521995507770407606/posts/default/4673830730577756707'/><link rel='alternate' type='text/html' href='http://radiosantaclarafm.blogspot.com/2011/05/music-maker-o-sucesso-da-semana_10.html' title='Music Maker - O Sucesso Da Semana (09.05.2011)'/><author><name>awmix</name><uri>http://www.blogger.com/profile/05913607318738669543</uri><email>[email protected]</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='04237720592496611900'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-TASUdjWaPTk/TclDogdGBoI/AAAAAAAAAMA/-BCEmy3eylo/s72-c/Adriana_Calcanhotto_2203%2521.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2521995507770407606.post-4389486062742970158</id><published>2011-05-04T02:03:00.000-03:00</published><updated>2011-05-04T02:03:15.026-03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='music maker'/><title type='text'>Music Maker - O Sucesso Da Semana (02.05.2011)</title><content type='html'><div class=""separator"" style=""clear: both; text-align: center;""><a href=""http://3.bp.blogspot.com/-4AWBq1nzFMM/TcDZlw8DsfI/AAAAAAAAAL8/9VJF4C4lxLY/s1600/Tie+-+Dois.jpg"" imageanchor=""1"" style=""margin-left: 1em; margin-right: 1em;""><img border=""0"" height=""213"" src=""http://3.bp.blogspot.com/-4AWBq1nzFMM/TcDZlw8DsfI/AAAAAAAAAL8/9VJF4C4lxLY/s320/Tie+-+Dois.jpg"" width=""320"" /></a></div><div style=""text-align: justify;"">Ti?? Gasparinetti ou somente Ti??, paulistana, mostra em sua musica uma MPB suave que faz viajar em melodias, poemas e notas que atravessam o tempo. Do seu primeiro album (Sweet Jardim), de 2009 selecionamos a faixa ''Dois'', que voc?? vai conferir durante toda essa semana no Music Maker. </div><div class=""blogger-post-footer""><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2521995507770407606-4389486062742970158?l=radiosantaclarafm.blogspot.com' alt='' /></div></content><link rel='replies' type='application/atom+xml' href='http://radiosantaclarafm.blogspot.com/feeds/4389486062742970158/comments/default' title='Postar coment??rios'/><link rel='replies' type='text/html' href='http://radiosantaclarafm.blogspot.com/2011/05/music-maker-o-sucesso-da-semana.html#comment-form' title='0 Coment??rios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2521995507770407606/posts/default/4389486062742970158'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2521995507770407606/posts/default/4389486062742970158'/><link rel='alternate' type='text/html' href='http://radiosantaclarafm.blogspot.com/2011/05/music-maker-o-sucesso-da-semana.html' title='Music Maker - O Sucesso Da Semana (02.05.2011)'/><author><name>awmix</name><uri>http://www.blogger.com/profile/05913607318738669543</uri><email>[email protected]</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='04237720592496611900'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-4AWBq1nzFMM/TcDZlw8DsfI/AAAAAAAAAL8/9VJF4C4lxLY/s72-c/Tie+-+Dois.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2521995507770407606.post-3621064109448287254</id><published>2011-04-25T01:27:00.003-03:00</published><updated>2011-04-25T09:24:59.022-03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='music maker'/><title type='text'>Music Maker - O Sucesso Da Semana (25.04.2011)</title><content type='html'><div class=""separator"" style=""clear: both; text-align: center;""><a href=""http://2.bp.blogspot.com/-uKv2Yyv6YhI/TbTzATluCfI/AAAAAAAAAL4/AZoocREWYLs/s1600/Alain_Clark.jpg"" imageanchor=""1"" style=""margin-left: 1em; margin-right: 1em;""><img border=""0"" height=""320"" src=""http://2.bp.blogspot.com/-uKv2Yyv6YhI/TbTzATluCfI/AAAAAAAAAL4/AZoocREWYLs/s320/Alain_Clark.jpg"" width=""318"" /></a></div><br /><div style=""text-align: justify;"">Sabe aquelas musicas que voc?? escuta e derrepente se pergunta - <i>Gente, acho que conhe??o essa musica?! </i>Ao ouvir Alain Clark temos realmente essa sensa????o. Meio do nada somos induzidos a pensar assim porque a musica dele ?? assim. Meio Stevie Wonder, Marvin Gayer e Ray Cherles o garoto do Haarlem, tem tudo aquilo que esperamos de um grande artista. Dele selecionamos ''Good Days''. Uma bel??ssima faixa de seu album de 2010 ''Colorblind'', que voc?? pode conferir em nossa programa????o durante toda essa semana no Music Maker. Boa audi????o.</div><div class=""blogger-post-footer""><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2521995507770407606-3621064109448287254?l=radiosantaclarafm.blogspot.com' alt='' /></div></content><link rel='replies' type='application/atom+xml' href='http://radiosantaclarafm.blogspot.com/feeds/3621064109448287254/comments/default' title='Postar coment??rios'/><link rel='replies' type='text/html' href='http://radiosantaclarafm.blogspot.com/2011/04/music-maker-o-sucesso-da-semana_6634.html#comment-form' title='0 Coment??rios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2521995507770407606/posts/default/3621064109448287254'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2521995507770407606/posts/default/3621064109448287254'/><link rel='alternate' type='text/html' href='http://radiosantaclarafm.blogspot.com/2011/04/music-maker-o-sucesso-da-semana_6634.html' title='Music Maker - O Sucesso Da Semana (25.04.2011)'/><author><name>awmix</name><uri>http://www.blogger.com/profile/05913607318738669543</uri><email>[email protected]</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='04237720592496611900'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-uKv2Yyv6YhI/TbTzATluCfI/AAAAAAAAAL4/AZoocREWYLs/s72-c/Alain_Clark.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2521995507770407606.post-6702871300289547413</id><published>2011-04-25T01:03:00.000-03:00</published><updated>2011-03-28T01:43:38.512-03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='music maker'/><title type='text'>Music Maker - O Sucesso Da Semana (06.03.2011)</title><content type='html'><div class=""separator"" style=""clear: both; text-align: center;""><a href=""https://lh5.googleusercontent.com/-aVGFGcJKk30/TXO8uNyhwlI/AAAAAAAAAK4/XGWfuojIQIw/s1600/nana-caymmi.jpg"" imageanchor=""1"" style=""margin-left: 1em; margin-right: 1em;""><img border=""0"" src=""https://lh5.googleusercontent.com/-aVGFGcJKk30/TXO8uNyhwlI/AAAAAAAAAK4/XGWfuojIQIw/s1600/nana-caymmi.jpg"" /></a></div><div class=""MsoNormal"" style=""margin: 5pt 0pt; text-align: justify;""><span lang=""PT"" style=""font-family: Arial; font-size: 11pt;"">Assim como Dionne Warwick e para a America, Nana Caymmi, e para nos brasileiros. Com algumas diferen??as no comparativo nossa diva tem em sua carreira a modesta simplicidade&nbsp; herdada do pai Dorival. Seu novo disco ???Sem Poupar Cora????o???, para os apaixonados e amantes da boa mpb, um album classico que n??o pode faltar em seu acervo. Do novo disco selecionamos uma musica que ja e destaque no cenario nacional ''</span><span lang=""PT"" style=""font-family: Arial; font-size: 11pt;"">Sem Poupar Cora????o'' que da nome ao CD. Durante toda semana voc??&nbsp; vai&nbsp; ouvir esse clasico da musica conte</span><span lang=""PT"" style=""font-family: Arial; font-size: 11pt;"">poraneo no Music Marker. Se ligue em 106.3 e boa audi????o</span><span lang=""PT"" style=""font-family: Arial; font-size: 11pt;""></span><span lang=""PT"" style=""font-family: Arial; font-size: 11pt;""></span>.</div><div class=""blogger-post-footer""><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2521995507770407606-8869788482425850808?l=radiosantaclarafm.blogspot.com' alt='' /></div></content><link rel='replies' type='application/atom+xml' href='http://radiosantaclarafm.blogspot.com/feeds/8869788482425850808/comments/default' title='Postar coment??rios'/><link rel='replies' type='text/html' href='http://radiosantaclarafm.blogspot.com/2011/03/music-maker-o-sucesso-da-semana.html#comment-form' title='0 Coment??rios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2521995507770407606/posts/default/8869788482425850808'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2521995507770407606/posts/default/8869788482425850808'/><link rel='alternate' type='text/html' href='http://radiosantaclarafm.blogspot.com/2011/03/music-maker-o-sucesso-da-semana.html' title='Music Maker - O Sucesso Da Semana (06.03.2011)'/><author><name>awmix</name><uri>http://www.blogger.com/profile/05913607318738669543</uri><email>[email protected]</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='04237720592496611900'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='https://lh5.googleusercontent.com/-aVGFGcJKk30/TXO8uNyhwlI/AAAAAAAAAK4/XGWfuojIQIw/s72-c/nana-caymmi.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2521995507770407606.post-1676448556167478841</id><published>2011-03-02T22:47:00.011-03:00</published><updated>2011-04-25T23:15:03.222-03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='music maker'/><title type='text'>Music Maker - O Sucesso Da Semana (27.02.2011)</title><content type='html'><div class=""separator"" style=""clear: both; text-align: center;""><a href=""http://3.bp.blogspot.com/-T6QAgGvU3bw/TW7zqHSi8uI/AAAAAAAAAKw/UsLo1djW-S0/s1600/jauperi.JPG"" imageanchor=""1"" style=""clear: left; float: left; margin-bottom: 1em; margin-right: 1em;""></a></div><div class=""separator"" style=""clear: both; text-align: center;""></div><div class=""separator"" style=""clear: both; text-align: center;""></div><div class=""separator"" style=""clear: both; text-align: center;""></div><div class=""separator"" style=""clear: both; text-align: center;""></div><div style=""margin-left: 1em; margin-right: 1em;""><img border=""0"" height=""314"" src=""http://3.bp.blogspot.com/-T6QAgGvU3bw/TW7zqHSi8uI/AAAAAAAAAKw/UsLo1djW-S0/s320/jauperi.JPG"" width=""320"" /></div><br /><div style=""margin-left: 1em; margin-right: 1em;""></div><br /><div style=""text-align: justify;"">Esse rapaz sorridente na foto acima ?? o grande Jauperi. Uma das melhores coisas que a Bahia produziu nos ??ltimos tempos. O mo??o fez reviver um cl??sssico de Caetano Veloso de 1969, Alegria Alegria. O cara mandou t??o&nbsp;bem no remaker que n??o tinhamos porque n??o tocar em nossa programa????o e com&nbsp;devido destaque. No Music Maker - A musica da Semana. Ent??o se ligue em nossa programa????o e confira Jauperi em 106,3.</div><div class=""blogger-post-footer""><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2521995507770407606-1676448556167478841?l=radiosantaclarafm.blogspot.com' alt='' /></div></content><link rel='replies' type='application/atom+xml' href='http://radiosantaclarafm.blogspot.com/feeds/1676448556167478841/comments/default' title='Postar coment??rios'/><link rel='replies' type='text/html' href='http://radiosantaclarafm.blogspot.com/2011/03/music-marker-musica-da-semana.html#comment-form' title='0 Coment??rios'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2521995507770407606/posts/default/1676448556167478841'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2521995507770407606/posts/default/1676448556167478841'/><link rel='alternate' type='text/html' href='http://radiosantaclarafm.blogspot.com/2011/03/music-marker-musica-da-semana.html' title='Music Maker - O Sucesso Da Semana (27.02.2011)'/><author><name>awmix</name><uri>http://www.blogger.com/profile/05913607318738669543</uri><email>[email protected]</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='04237720592496611900'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-T6QAgGvU3bw/TW7zqHSi8uI/AAAAAAAAAKw/UsLo1djW-S0/s72-c/jauperi.JPG' height='72' width='72'/><thr:total>0</thr:total>";
}
}
}
Upvotes: 2
Views: 302
Reputation: 336198
<a.*?href=["'](?<url>.*?)["'].*?>(?<name>.*?)</a>
is a sure-fire recipe for catastrophic backtracking because the dot can match any character and you have no fewer than four instances of .*?
in your regex which (given a long enough string) allows for an exponentially increasing number of permutations. This will become a problem especially if a string should fail to match the regex.
Solution:
Example:
<a[^<>]*href=["'](?<url>[^"']*)["'][^<>]*>(?<name>.*?)</a>
should help performance a lot.
[^<>]*
matches any number of characters except angle brackets, thereby ensuring that you'll never cross into another tag.
[^"']*
does the same for quotes.
(Remember to double the "
s inside a C# string; I've shown the pure regexes here)
Upvotes: 5