the.slow.one
the.slow.one

Reputation: 410

Python regular expression for hangs for a specific input

A regular expression is formed for URL as follows.

(?i)\b((?:[a-z][\w-]+://(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:''.,<>?]))

I'm using the same for last 6 months in my project and it worked well. But today, for an of the following input, it hangs the code.

'$.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.recaptchaPublicKey,"recaptcha",{theme:"clean",lang:StackExchange.options.site.recaptchaAudioLang,custom_translations:{visual_challenge:"Get a visual challenge",audio_challenge:"Get an audio challenge",refresh_btn:"Get a new challenge",instructions_visual:"Type the text:",instructions_audio:"Type what you hear:",help_btn:"Help",play_again:"Play sound again",cant_hear_this:"'

To find out the reason, I separated each character, appended it with part of input succeeding the character and applied the regular expression on the formed substring. And I got the following output

Scanned text: $
Time taken: 9.05990600586e-06

Scanned text: $.
Time taken: 5.96046447754e-06

Scanned text: $.g
Time taken: 4.05311584473e-06

Scanned text: $.ge
Time taken: 4.05311584473e-06

Scanned text: $.get
Time taken: 3.81469726562e-06

Scanned text: $.getS
Time taken: 5.00679016113e-06

Scanned text: $.getSc
Time taken: 5.00679016113e-06

Scanned text: $.getScr
Time taken: 5.96046447754e-06

Scanned text: $.getScri
Time taken: 1.00135803223e-05

Scanned text: $.getScrip
Time taken: 1.09672546387e-05

Scanned text: $.getScript
Time taken: 1.21593475342e-05

Scanned text: $.getScript(
Time taken: 1.4066696167e-05

Scanned text: $.getScript("
Time taken: 1.50203704834e-05

Scanned text: $.getScript("h
Time taken: 1.59740447998e-05

Scanned text: $.getScript("ht
Time taken: 1.59740447998e-05

Scanned text: $.getScript("htt
Time taken: 1.50203704834e-05

Scanned text: $.getScript("http
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http:
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http:/
Time taken: 1.90734863281e-05

Scanned text: $.getScript("http://
Time taken: 1.97887420654e-05

Scanned text: $.getScript("http://w
Time taken: 2.21729278564e-05

Scanned text: $.getScript("http://ww
Time taken: 2.31266021729e-05

Scanned text: $.getScript("http://www
Time taken: 2.09808349609e-05

Scanned text: $.getScript("http://www.
Time taken: 2.28881835938e-05

Scanned text: $.getScript("http://www.g
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.go
Time taken: 2.09808349609e-05

Scanned text: $.getScript("http://www.goo
Time taken: 2.19345092773e-05

Scanned text: $.getScript("http://www.goog
Time taken: 2.00271606445e-05

Scanned text: $.getScript("http://www.googl
Time taken: 2.19345092773e-05

Scanned text: $.getScript("http://www.google
Time taken: 2.09808349609e-05

Scanned text: $.getScript("http://www.google.
Time taken: 2.28881835938e-05

Scanned text: $.getScript("http://www.google.c
Time taken: 2.19345092773e-05

Scanned text: $.getScript("http://www.google.co
Time taken: 2.21729278564e-05

Scanned text: $.getScript("http://www.google.com
Time taken: 2.90870666504e-05

Scanned text: $.getScript("http://www.google.com/
Time taken: 1.8835067749e-05

Scanned text: $.getScript("http://www.google.com/r
Time taken: 2.21729278564e-05

Scanned text: $.getScript("http://www.google.com/re
Time taken: 2.28881835938e-05

Scanned text: $.getScript("http://www.google.com/rec
Time taken: 2.31266021729e-05

Scanned text: $.getScript("http://www.google.com/reca
Time taken: 2.19345092773e-05

Scanned text: $.getScript("http://www.google.com/recap
Time taken: 2.121925354e-05

Scanned text: $.getScript("http://www.google.com/recapt
Time taken: 2.28881835938e-05

Scanned text: $.getScript("http://www.google.com/recaptc
Time taken: 2.19345092773e-05

Scanned text: $.getScript("http://www.google.com/recaptch
Time taken: 2.40802764893e-05

Scanned text: $.getScript("http://www.google.com/recaptcha
Time taken: 1.71661376953e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/
Time taken: 1.31130218506e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/a
Time taken: 1.28746032715e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/ap
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api
Time taken: 1.31130218506e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/
Time taken: 1.28746032715e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/j
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js
Time taken: 2.19345092773e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/
Time taken: 2.28881835938e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/r
Time taken: 2.40802764893e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/re
Time taken: 2.09808349609e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/rec
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/reca
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recap
Time taken: 1.38282775879e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recapt
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptc
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptch
Time taken: 1.38282775879e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_a
Time taken: 1.38282775879e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_aj
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_aja
Time taken: 1.50203704834e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax
Time taken: 1.38282775879e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.j
Time taken: 1.38282775879e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js"
Time taken: 1.50203704834e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",f
Time taken: 1.4066696167e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",fu
Time taken: 1.50203704834e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",fun
Time taken: 1.50203704834e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",func
Time taken: 1.50203704834e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",funct
Time taken: 1.50203704834e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",functi
Time taken: 1.90734863281e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",functio
Time taken: 1.47819519043e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function
Time taken: 1.50203704834e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function()
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){R
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Re
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Rec
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Reca
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recap
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recapt
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptc
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptch
Time taken: 1.71661376953e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.c
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.cr
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.cre
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.crea
Time taken: 1.59740447998e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.creat
Time taken: 1.71661376953e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create
Time taken: 1.69277191162e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(
Time taken: 1.81198120117e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(S
Time taken: 2.00271606445e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(St
Time taken: 2.21729278564e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(Sta
Time taken: 2.69412994385e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(Stac
Time taken: 3.50475311279e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(Stack
Time taken: 5.19752502441e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackE
Time taken: 8.48770141602e-05

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackEx
Time taken: 0.000190019607544

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExc
Time taken: 0.00028395652771

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExch
Time taken: 0.000540971755981

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExcha
Time taken: 0.00104284286499

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchan
Time taken: 0.00213289260864

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchang
Time taken: 0.00400710105896

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange
Time taken: 0.00786900520325

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.
Time taken: 0.0186932086945

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.o
Time taken: 0.0350210666656

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.op
Time taken: 0.0638809204102

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.opt
Time taken: 0.126477003098

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.opti
Time taken: 0.257136106491

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.optio
Time taken: 0.5178399086

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.option
Time taken: 1.06005501747

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options
Time taken: 2.07899594307

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.
Time taken: 4.15297198296

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.s
Time taken: 8.27355504036

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.si
Time taken: 16.3851959705

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.sit
Time taken: 32.8552730083

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site
Time taken: 66.1672520638

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.
Time taken: 132.832139015

Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.r
Time taken: 266.925510168

Sent a keyboard interrupt, to kill the process, as it was taking a lot of time.

It can be observed that after appending each character time needed to run re.findall() is doubled. It' will take a lot of time to run the regex on whole line. So actually it is not in a hang stated but it is taking very very long time.

I'm not sure why re.findall is taking that much time. One way to avoid this could be, to restrict the outcome of re.findall() to run for a specific time duration. I'm guessing there could be a simpler way to avoid this. If so kindly suggest.

Incase needed the code is as following.

import re
import time

url = re.compile('(?i)((?:[a-z][\w-]+://(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:''.,<>?]))')
buggy = '$.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.recaptchaPublicKey,"recaptcha",{theme:"clean",lang:StackExchange.options.site.recaptchaAudioLang,custom_translations:{visual_challenge:"Get a visual challenge",audio_challenge:"Get an audio challenge",refresh_btn:"Get a new challenge",instructions_visual:"Type the text:",instructions_audio:"Type what you hear:",help_btn:"Help",play_again:"Play sound again",cant_hear_this:"'

buggy_part = ''
for each_char in buggy:
    buggy_part += each_char
    start = time.time()
    re.findall(url, buggy_part)
    done = time.time()
    elapsed = done - start
    print 'Scanned text: ' + buggy_part
    print 'Time taken: ' + str(elapsed) + '\n'

Upvotes: 1

Views: 161

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121834

I find your expression neigh unreadable, but it is clear that you are suffering from catastrophic backtracking; exponential performance degradation due to greedy patterns.

And indeed, replacing your + quantifiers with +? non-greedy quantifiers makes your test script run near instantly:

>>> url = re.compile('(?i)((?:[a-z][\w-]+?://(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+?[.][a-z]{2,4}/)(?:[^\s()<>]+?|\(([^\s()<>]+?|(\([^\s()<>]+?\)))*\))+?(?:\(([^\s()<>]+?|(\([^\s()<>]+?\)))*\)|[^\s`!()\[\]{};:''.,<>?]))')
[.. your script ..]
[('http://www', '', '', '', ''), ('.google.com/re', '', '', '', '')]
Scanned text: $.getScript("http://www.google.com/recaptcha/api/js/recaptcha_ajax.js",function(){Recaptcha.create(StackExchange.options.site.recaptchaPublicKey,"recaptcha",{theme:"clean",lang:StackExchange.options.site.recaptchaAudioLang,custom_translations:{visual_challenge:"Get a visual challenge",audio_challenge:"Get an audio challenge",refresh_btn:"Get a new challenge",instructions_visual:"Type the text:",instructions_audio:"Type what you hear:",help_btn:"Help",play_again:"Play sound again",cant_hear_this:"
Time taken: 0.000473022460938

This may not be the right match; perhaps one of the patterns is now not greedy enough. Try to limit the pattern to just one or two greedy matches, look for more specific patterns and anchors to limit the range of a greedy search. Most likely you don't need quite as many greedy quantifiers to get the job done.

Upvotes: 1

Related Questions