Nick Ginanto
Nick Ginanto

Reputation: 32170

extracting facebook page from html using regex

I am trying to get the address of a facebook page of websites using regular expression search on the html

usually the link appears as <a href="http://www.facebook.com/googlechrome">Facebook</a>

but sometimes the address will be http://www.facebook.com/some.other

and sometimes with numbers

at the moment the regex that I have is

'(facebook.com)\S\w+'

but it won't catch the last 2 possibilites

what is it called when I want the regex to search but not fetch it? (for instance I want the regex to match the www.facbook.com part but not have that part in the result, only the part that comes after it

note I use python with re and urllib2

Upvotes: 0

Views: 1188

Answers (2)

root
root

Reputation: 80396

if i assume correctly, the url is always in double quotes. right?

re.findall(r'"http://www.facebook.com(.+?)"',url)

Overall, trying to parse html with regex is a bad idea. I suggest you use an html parser like lxml.html to find the links and then use urlparse

>>> from urlparse import urlparse # in 3.x use from urllib.parse import urlparse
>>> url = 'http://www.facebook.com/some.other'
>>> parse_object = urlparse(url)
>>> parse_object.netloc
'facebook.com'
>>> parse_object.path
'/some.other'

Upvotes: 0

Inbar Rose
Inbar Rose

Reputation: 43467

seems to me your main issue is that you dont understand enough regex.

fb_re = re.compile(r'www.facebook.com([^"]+)')

then simply:

results = fb_re.findall(url)

why this works:

in regular expresions the part in the parenthesis () is what is captured, you were putting the www.facebook.com part in the parenthesis and so it was not getting anything else.

here i used a character set [] to match anything in there, i used the ^ operator to negate that, which means anything not in the set, and then i gave it the " character, so it will match anything that comes after www.facebook.com until it reaches a " and then stop.

note - this catches facebook links which are embedded, if the facebook link is simply on the page in plaintext you can use:

fb_re = re.compile(r'www.facebook.com(\S+)')

which means to grab any non-white-space character, so it will stop once it runs out of white-space.

if you are worried about links ending in periods, you can simply add:

fb_re = re.compile(r'www.facebook.com(\S+)\.\s')

which tells it to search for the same above, but stop when it gets to the end of a sentence, . followed by any white-space like a space or enter. this way it will still grab links like /some.other but when you have things like /some.other. it will remove the last .

Upvotes: 1

Related Questions