user2988983
user2988983

Reputation: 33

What am i doing wrong with this regular expression

links = re.findall('href="(http(s?)://[^"]+)"',page)

I have this regular expression to find all links in a website, I am getting this result:

('http://asecuritysite.com', '')
('https://www.sans.org/webcasts/archive/2013', 's')

When what I want is only this:

http://asecuritysite.com
https://www.sans.org/webcasts/archive/2013

If I eliminate the "( after the href it gives me loads of errors, can someone explain why?

Upvotes: 0

Views: 78

Answers (4)

brandonscript
brandonscript

Reputation: 72855

You're going to run into problems too if it's a single quote before the https? instead of double.

(https?:\/\/[^\"\'\>]+) will capture the entire string; what you could then do is prepend (href=.?) to it, and you'd end up with two capture groups:

Full regex: (href=.?)(https?:\/\/[^\"\'\>]+)

MATCH 1

  • [Group 1] href='
  • [Group 2] http://asecuritysite.com

MATCH 2

  • [Group 1] href='
  • [Group 2] https://www.sans.org/webcasts/archive/2013

http://regex101.com/r/gO8vV7 here is a working example

Upvotes: 0

falsetru
falsetru

Reputation: 368954

If you use more than 1 capturing group, re.findall return list of tuples instead of list of strings. Try following (only using single group):

>>> import re
>>> page = '''
...     <a href="http://asecuritysite.com">here</a>
...     <a href="https://www.sans.org/webcasts/archive/2013">there</a>
...     '''
>>> re.findall(r'href="(https?:\/\/[^"]+)"',page)
['http://asecuritysite.com', 'https://www.sans.org/webcasts/archive/2013']

According to re.findall documentation:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

Upvotes: 2

Jakob Bowyer
Jakob Bowyer

Reputation: 34688

What you are doing wrong is trying to parse HTML with Regex. And that sir, is a sin.

See here for the horrors of Regex parsing HTML

An alternative is to use something like lxml to parse the page and extract the links something like this

urls = html.xpath('//a/@href')

Upvotes: 1

p.s.w.g
p.s.w.g

Reputation: 149000

Try getting rid of the second group (the (s?) in your original pattern):

links = re.findall('href="(https?:\/\/[^"]+)"',page)

Upvotes: 1

Related Questions