badc0re
badc0re

Reputation: 3523

improve robots regular expression

I have made the following regexp for extracting robots links:

re.compile(r"/\S+(?:\/+)")

And i get the following result:

/includes/
/modules/
/search/
/?q=user/password/
/?q=user/register/
/node/add/
/logout/
/?q=admin/
/themes/
/?q=node/add/
/admin/
/?q=comment/reply/
/misc/
//example.com/
//example.com/site/
/profiles/
//www.robotstxt.org/wc/
/?q=search/
/user/password/
/?q=logout/
/comment/reply/
/?q=filter/tips/
/?q=user/login/
/user/register/
/user/login/
/scripts/
/filter/tips/
//www.sxw.org.uk/computing/robots/

How can i exclude the links that has two slashes like:

 //www.sxw.org.uk/computing/robots/
 //www.robotstxt.org/wc/
 //example.com/
 //example.com/site/

Any ideas ??

Upvotes: 1

Views: 120

Answers (2)

buckley
buckley

Reputation: 14089

Assuming that the strings to be matches occur on each line as in the sample we can anchor the regex and use negative lookahead

^(?!//)/\S+(?:\/+)

Be sure to set the regex modifier that makes ^ match the beginning of a line.

My Python is rusty but this should do it

for match in re.finditer(r"(?m)^(?!//)/\S+(?:/+)", subject):
    # match start: match.start()
    # match end (exclusive): match.end()
    # matched text: match.group()

Upvotes: 1

Ashwini Chaudhary
Ashwini Chaudhary

Reputation: 250971

I'll suggest just add an if condition:

 if not line.startswith(r'//'):
     #then do something here

Upvotes: 1

Related Questions