Reputation: 32170
I am trying to get the address of a facebook page of websites using regular expression search on the html
usually the link appears as
<a href="http://www.facebook.com/googlechrome">Facebook</a>
but sometimes the address will be http://www.facebook.com/some.other
and sometimes with numbers
at the moment the regex that I have is
'(facebook.com)\S\w+'
but it won't catch the last 2 possibilites
what is it called when I want the regex to search but not fetch it? (for instance I want the regex to match the www.facbook.com part but not have that part in the result, only the part that comes after it
note I use python with re and urllib2
Upvotes: 0
Views: 1188
Reputation: 80396
if i assume correctly, the url is always in double quotes. right?
re.findall(r'"http://www.facebook.com(.+?)"',url)
Overall, trying to parse html with regex is a bad idea. I suggest you use an html parser like lxml.html
to find the links and then use urlparse
>>> from urlparse import urlparse # in 3.x use from urllib.parse import urlparse
>>> url = 'http://www.facebook.com/some.other'
>>> parse_object = urlparse(url)
>>> parse_object.netloc
'facebook.com'
>>> parse_object.path
'/some.other'
Upvotes: 0
Reputation: 43467
seems to me your main issue is that you dont understand enough regex.
fb_re = re.compile(r'www.facebook.com([^"]+)')
then simply:
results = fb_re.findall(url)
why this works:
in regular expresions the part in the parenthesis ()
is what is captured, you were putting the www.facebook.com part in the parenthesis and so it was not getting anything else.
here i used a character set []
to match anything in there, i used the ^
operator to negate that, which means anything not in the set, and then i gave it the "
character, so it will match anything that comes after www.facebook.com until it reaches a "
and then stop.
note - this catches facebook links which are embedded, if the facebook link is simply on the page in plaintext you can use:
fb_re = re.compile(r'www.facebook.com(\S+)')
which means to grab any non-white-space character, so it will stop once it runs out of white-space.
if you are worried about links ending in periods, you can simply add:
fb_re = re.compile(r'www.facebook.com(\S+)\.\s')
which tells it to search for the same above, but stop when it gets to the end of a sentence, .
followed by any white-space like a space or enter. this way it will still grab links like /some.other
but when you have things like /some.other.
it will remove the last .
Upvotes: 1