extracting facebook page from html using regex

Question

I am trying to get the address of a facebook page of websites using regular expression search on the html

usually the link appears as Facebook

but sometimes the address will be http://www.facebook.com/some.other

and sometimes with numbers

at the moment the regex that I have is

'(facebook.com)\S\w+'

but it won't catch the last 2 possibilites

what is it called when I want the regex to search but not fetch it? (for instance I want the regex to match the www.facbook.com part but not have that part in the result, only the part that comes after it

note I use python with re and urllib2

Inbar Rose · Accepted Answer

seems to me your main issue is that you dont understand enough regex.

fb_re = re.compile(r'www.facebook.com([^"]+)')

then simply:

results = fb_re.findall(url)

why this works:

in regular expresions the part in the parenthesis () is what is captured, you were putting the www.facebook.com part in the parenthesis and so it was not getting anything else.

here i used a character set [] to match anything in there, i used the ^ operator to negate that, which means anything not in the set, and then i gave it the " character, so it will match anything that comes after www.facebook.com until it reaches a " and then stop.

note - this catches facebook links which are embedded, if the facebook link is simply on the page in plaintext you can use:

fb_re = re.compile(r'www.facebook.com(\S+)')

which means to grab any non-white-space character, so it will stop once it runs out of white-space.

if you are worried about links ending in periods, you can simply add:

fb_re = re.compile(r'www.facebook.com(\S+)\.\s')

which tells it to search for the same above, but stop when it gets to the end of a sentence, . followed by any white-space like a space or enter. this way it will still grab links like /some.other but when you have things like /some.other. it will remove the last .

extracting facebook page from html using regex

Answers (2)

Related Questions