Reputation: 33
links = re.findall('href="(http(s?)://[^"]+)"',page)
I have this regular expression to find all links in a website, I am getting this result:
('http://asecuritysite.com', '')
('https://www.sans.org/webcasts/archive/2013', 's')
When what I want is only this:
http://asecuritysite.com
https://www.sans.org/webcasts/archive/2013
If I eliminate the "(
after the href it gives me loads of errors, can someone explain why?
Upvotes: 0
Views: 78
Reputation: 72855
You're going to run into problems too if it's a single quote before the https?
instead of double.
(https?:\/\/[^\"\'\>]+)
will capture the entire string; what you could then do is prepend (href=.?)
to it, and you'd end up with two capture groups:
Full regex: (href=.?)(https?:\/\/[^\"\'\>]+)
MATCH 1
href='
http://asecuritysite.com
MATCH 2
href='
https://www.sans.org/webcasts/archive/2013
http://regex101.com/r/gO8vV7 here is a working example
Upvotes: 0
Reputation: 368954
If you use more than 1 capturing group, re.findall
return list of tuples instead of list of strings. Try following (only using single group):
>>> import re
>>> page = '''
... <a href="http://asecuritysite.com">here</a>
... <a href="https://www.sans.org/webcasts/archive/2013">there</a>
... '''
>>> re.findall(r'href="(https?:\/\/[^"]+)"',page)
['http://asecuritysite.com', 'https://www.sans.org/webcasts/archive/2013']
According to re.findall
documentation:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
Upvotes: 2
Reputation: 34688
What you are doing wrong is trying to parse HTML with Regex. And that sir, is a sin.
See here for the horrors of Regex parsing HTML
An alternative is to use something like lxml to parse the page and extract the links something like this
urls = html.xpath('//a/@href')
Upvotes: 1
Reputation: 149000
Try getting rid of the second group (the (s?)
in your original pattern):
links = re.findall('href="(https?:\/\/[^"]+)"',page)
Upvotes: 1