Fuxi
Fuxi

Reputation: 7599

Regex Problem (newbie)

i'm writing a little app for spam-checking and i'm having problems with a regex.

let's say i'm having this spam-url:

http://hosting.tyumen.ru/tip.html 

so i want to check its url for having 2 full stops (subdomain+ending), a slash, a word, full stop and "html".

here's what i got so far:

<a href="(http://.*?\..*?..*?/.*?.html)">(http://.*?\..*?..*?/.*?.html)</a>

might look like rubbish but it works - the problem: it's really slow and freezing my app.

any hints on how to optimize it? thx.re

Upvotes: 0

Views: 221

Answers (5)

YOU
YOU

Reputation: 123937

(http://[\w.-]+/.+?\.html) - may be will work for your case only.

or may be faster one

(http://[\w.-]+/[^.]+\.html)

Upvotes: 2

Mez
Mez

Reputation: 24951

#http://[-a-zA-Z0-9]+\.[-a-zA-Z0-9]+\.[-a-zA-Z]+/\w+\.html#

Upvotes: 0

ikkebr
ikkebr

Reputation: 801

In Python, a simple way to match URLs ending in .html or .htm is to use

url_re = re.compile(
    r'https?://' # http:// or https://
    r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?|' #domain...
    r'localhost|' #localhost...
    r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
    r'(?::\d+)?' # optional port
    r'(?:\S+.html?)+' # ending in .html
     , re.IGNORECASE)

which is a modified version of Django's UrlField regex.

This will match any site ending with .html or .htm. (either localhost, ip, domain).

Upvotes: 0

David Hedlund
David Hedlund

Reputation: 129832

The reason it's slow is that the non-greedy operators ? being used this way is prone to catastrophic backtracking

Instead of saying "any amount of anything, but only to an extent where it doesn't conflict with later requirements", which is effectively what .*? is saying, try asking for "as much as possible, that isn't a double quote, which would terminate the href ":

<a href="(http://[^"]+\.[^"]+\.[^"]+/[^"]+.html)">\1</a>

I also added a back-reference (\1) to your first capturing group, inside the <a>...</a>, so that you don't have to do the exact same matching all over again.

Note that this regex will be broken if, say, the a has a class name, an id, or anything else in its body. I left it like this because I wanted to give you what you asked for with as few changes as possible, and as to-the-point as possible.

Upvotes: 5

bbb
bbb

Reputation: 702

Since you claim to be a regexp newbie, I will offer a more general advice on creating and debugging regular expressions. When they get pretty complicated, I find using Regexp Coach a must.

It's a freeware and really saves a lot of headache. Not to mention you don't have to build / run your application every minute just to see if the regexp works the way you wanted.

Upvotes: 2

Related Questions