Reputation: 7599
i'm writing a little app for spam-checking and i'm having problems with a regex.
let's say i'm having this spam-url:
http://hosting.tyumen.ru/tip.html
so i want to check its url for having 2 full stops (subdomain+ending), a slash, a word, full stop and "html".
here's what i got so far:
<a href="(http://.*?\..*?..*?/.*?.html)">(http://.*?\..*?..*?/.*?.html)</a>
might look like rubbish but it works - the problem: it's really slow and freezing my app.
any hints on how to optimize it? thx.re
Upvotes: 0
Views: 221
Reputation: 123937
(http://[\w.-]+/.+?\.html)
- may be will work for your case only.
or may be faster one
(http://[\w.-]+/[^.]+\.html)
Upvotes: 2
Reputation: 801
In Python, a simple way to match URLs ending in .html or .htm is to use
url_re = re.compile(
r'https?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?|' #domain...
r'localhost|' #localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
r'(?::\d+)?' # optional port
r'(?:\S+.html?)+' # ending in .html
, re.IGNORECASE)
which is a modified version of Django's UrlField regex.
This will match any site ending with .html or .htm. (either localhost, ip, domain).
Upvotes: 0
Reputation: 129832
The reason it's slow is that the non-greedy operators ?
being used this way is prone to catastrophic backtracking
Instead of saying "any amount of anything, but only to an extent where it doesn't conflict with later requirements", which is effectively what .*?
is saying, try asking for "as much as possible, that isn't a double quote, which would terminate the href
":
<a href="(http://[^"]+\.[^"]+\.[^"]+/[^"]+.html)">\1</a>
I also added a back-reference (\1
) to your first capturing group, inside the <a>...</a>
, so that you don't have to do the exact same matching all over again.
Note that this regex will be broken if, say, the a
has a class name, an id, or anything else in its body. I left it like this because I wanted to give you what you asked for with as few changes as possible, and as to-the-point as possible.
Upvotes: 5
Reputation: 702
Since you claim to be a regexp newbie, I will offer a more general advice on creating and debugging regular expressions. When they get pretty complicated, I find using Regexp Coach a must.
It's a freeware and really saves a lot of headache. Not to mention you don't have to build / run your application every minute just to see if the regexp works the way you wanted.
Upvotes: 2