Reputation: 410
A regular expression is formed for URL as follows.
(?i)\b((?:[a-z][\w-]+://(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:''.,<>?]))
I'm using the same for last 6 months in my project and it worked well. But today, for an of the following input, it hangs the code.
'$.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.recaptchaPublicKey,"recaptcha",{theme:"clean",lang:StackExchange.options.site.recaptchaAudioLang,custom_translations:{visual_challenge:"Get a visual challenge",audio_challenge:"Get an audio challenge",refresh_btn:"Get a new challenge",instructions_visual:"Type the text:",instructions_audio:"Type what you hear:",help_btn:"Help",play_again:"Play sound again",cant_hear_this:"'
To find out the reason, I separated each character, appended it with part of input succeeding the character and applied the regular expression on the formed substring. And I got the following output
Scanned text: $
Time taken: 9.05990600586e-06
Scanned text: $.
Time taken: 5.96046447754e-06
Scanned text: $.g
Time taken: 4.05311584473e-06
Scanned text: $.ge
Time taken: 4.05311584473e-06
Scanned text: $.get
Time taken: 3.81469726562e-06
Scanned text: $.getS
Time taken: 5.00679016113e-06
Scanned text: $.getSc
Time taken: 5.00679016113e-06
Scanned text: $.getScr
Time taken: 5.96046447754e-06
Scanned text: $.getScri
Time taken: 1.00135803223e-05
Scanned text: $.getScrip
Time taken: 1.09672546387e-05
Scanned text: $.getScript
Time taken: 1.21593475342e-05
Scanned text: $.getScript(
Time taken: 1.4066696167e-05
Scanned text: $.getScript("
Time taken: 1.50203704834e-05
Scanned text: $.getScript("h
Time taken: 1.59740447998e-05
Scanned text: $.getScript("ht
Time taken: 1.59740447998e-05
Scanned text: $.getScript("htt
Time taken: 1.50203704834e-05
Scanned text: $.getScript("http
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http:
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http:/
Time taken: 1.90734863281e-05
Scanned text: $.getScript("http://
Time taken: 1.97887420654e-05
Scanned text: $.getScript("http://w
Time taken: 2.21729278564e-05
Scanned text: $.getScript("http://ww
Time taken: 2.31266021729e-05
Scanned text: $.getScript("http://www
Time taken: 2.09808349609e-05
Scanned text: $.getScript("http://www.
Time taken: 2.28881835938e-05
Scanned text: $.getScript("http://www.g
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.go
Time taken: 2.09808349609e-05
Scanned text: $.getScript("http://www.goo
Time taken: 2.19345092773e-05
Scanned text: $.getScript("http://www.goog
Time taken: 2.00271606445e-05
Scanned text: $.getScript("http://www.googl
Time taken: 2.19345092773e-05
Scanned text: $.getScript("http://www.google
Time taken: 2.09808349609e-05
Scanned text: $.getScript("http://www.google.
Time taken: 2.28881835938e-05
Scanned text: $.getScript("http://www.google.c
Time taken: 2.19345092773e-05
Scanned text: $.getScript("http://www.google.co
Time taken: 2.21729278564e-05
Scanned text: $.getScript("http://www.google.com
Time taken: 2.90870666504e-05
Scanned text: $.getScript("http://www.google.com/
Time taken: 1.8835067749e-05
Scanned text: $.getScript("http://www.google.com/r
Time taken: 2.21729278564e-05
Scanned text: $.getScript("http://www.google.com/re
Time taken: 2.28881835938e-05
Scanned text: $.getScript("http://www.google.com/rec
Time taken: 2.31266021729e-05
Scanned text: $.getScript("http://www.google.com/reca
Time taken: 2.19345092773e-05
Scanned text: $.getScript("http://www.google.com/recap
Time taken: 2.121925354e-05
Scanned text: $.getScript("http://www.google.com/recapt
Time taken: 2.28881835938e-05
Scanned text: $.getScript("http://www.google.com/recaptc
Time taken: 2.19345092773e-05
Scanned text: $.getScript("http://www.google.com/recaptch
Time taken: 2.40802764893e-05
Scanned text: $.getScript("http://www.google.com/recaptcha
Time taken: 1.71661376953e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/
Time taken: 1.31130218506e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/a
Time taken: 1.28746032715e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/ap
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api
Time taken: 1.31130218506e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/
Time taken: 1.28746032715e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/j
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js
Time taken: 2.19345092773e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/
Time taken: 2.28881835938e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/r
Time taken: 2.40802764893e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/re
Time taken: 2.09808349609e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/rec
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/reca
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recap
Time taken: 1.38282775879e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recapt
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptc
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptch
Time taken: 1.38282775879e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_a
Time taken: 1.38282775879e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_aj
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_aja
Time taken: 1.50203704834e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax
Time taken: 1.38282775879e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.j
Time taken: 1.38282775879e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js"
Time taken: 1.50203704834e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",f
Time taken: 1.4066696167e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",fu
Time taken: 1.50203704834e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",fun
Time taken: 1.50203704834e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",func
Time taken: 1.50203704834e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",funct
Time taken: 1.50203704834e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",functi
Time taken: 1.90734863281e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",functio
Time taken: 1.47819519043e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function
Time taken: 1.50203704834e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function()
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){R
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Re
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Rec
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Reca
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recap
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recapt
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptc
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptch
Time taken: 1.71661376953e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.c
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.cr
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.cre
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.crea
Time taken: 1.59740447998e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.creat
Time taken: 1.71661376953e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create
Time taken: 1.69277191162e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(
Time taken: 1.81198120117e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(S
Time taken: 2.00271606445e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(St
Time taken: 2.21729278564e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(Sta
Time taken: 2.69412994385e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(Stac
Time taken: 3.50475311279e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(Stack
Time taken: 5.19752502441e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackE
Time taken: 8.48770141602e-05
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackEx
Time taken: 0.000190019607544
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExc
Time taken: 0.00028395652771
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExch
Time taken: 0.000540971755981
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExcha
Time taken: 0.00104284286499
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchan
Time taken: 0.00213289260864
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchang
Time taken: 0.00400710105896
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange
Time taken: 0.00786900520325
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.
Time taken: 0.0186932086945
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.o
Time taken: 0.0350210666656
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.op
Time taken: 0.0638809204102
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.opt
Time taken: 0.126477003098
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.opti
Time taken: 0.257136106491
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.optio
Time taken: 0.5178399086
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.option
Time taken: 1.06005501747
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options
Time taken: 2.07899594307
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.
Time taken: 4.15297198296
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.s
Time taken: 8.27355504036
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.si
Time taken: 16.3851959705
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.sit
Time taken: 32.8552730083
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site
Time taken: 66.1672520638
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.
Time taken: 132.832139015
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.r
Time taken: 266.925510168
Sent a keyboard interrupt, to kill the process, as it was taking a lot of time.
It can be observed that after appending each character time needed to run re.findall() is doubled. It' will take a lot of time to run the regex on whole line. So actually it is not in a hang stated but it is taking very very long time.
I'm not sure why re.findall is taking that much time. One way to avoid this could be, to restrict the outcome of re.findall() to run for a specific time duration. I'm guessing there could be a simpler way to avoid this. If so kindly suggest.
Incase needed the code is as following.
import re
import time
url = re.compile('(?i)((?:[a-z][\w-]+://(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:''.,<>?]))')
buggy = '$.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.recaptchaPublicKey,"recaptcha",{theme:"clean",lang:StackExchange.options.site.recaptchaAudioLang,custom_translations:{visual_challenge:"Get a visual challenge",audio_challenge:"Get an audio challenge",refresh_btn:"Get a new challenge",instructions_visual:"Type the text:",instructions_audio:"Type what you hear:",help_btn:"Help",play_again:"Play sound again",cant_hear_this:"'
buggy_part = ''
for each_char in buggy:
buggy_part += each_char
start = time.time()
re.findall(url, buggy_part)
done = time.time()
elapsed = done - start
print 'Scanned text: ' + buggy_part
print 'Time taken: ' + str(elapsed) + '\n'
Upvotes: 1
Views: 161
Reputation: 1121834
I find your expression neigh unreadable, but it is clear that you are suffering from catastrophic backtracking; exponential performance degradation due to greedy patterns.
And indeed, replacing your +
quantifiers with +?
non-greedy quantifiers makes your test script run near instantly:
>>> url = re.compile('(?i)((?:[a-z][\w-]+?://(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+?[.][a-z]{2,4}/)(?:[^\s()<>]+?|\(([^\s()<>]+?|(\([^\s()<>]+?\)))*\))+?(?:\(([^\s()<>]+?|(\([^\s()<>]+?\)))*\)|[^\s`!()\[\]{};:''.,<>?]))')
[.. your script ..]
[('http://www', '', '', '', ''), ('.google.com/re', '', '', '', '')]
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.recaptchaPublicKey,"recaptcha",{theme:"clean",lang:StackExchange.options.site.recaptchaAudioLang,custom_translations:{visual_challenge:"Get a visual challenge",audio_challenge:"Get an audio challenge",refresh_btn:"Get a new challenge",instructions_visual:"Type the text:",instructions_audio:"Type what you hear:",help_btn:"Help",play_again:"Play sound again",cant_hear_this:"
Time taken: 0.000473022460938
This may not be the right match; perhaps one of the patterns is now not greedy enough. Try to limit the pattern to just one or two greedy matches, look for more specific patterns and anchors to limit the range of a greedy search. Most likely you don't need quite as many greedy quantifiers to get the job done.
Upvotes: 1