user2695817
user2695817

Reputation: 121

write a python regex to match multiple urls in a html source page using beautifulsoup

i am working on web scraping using beautifulsoup and trying to get links in a html page for given list of urls.

suppose if i want to get facebook and twitter links in a page, I tried

urls_list = ['www.facebook.com','www.apps.facebook.com', 'www.twitter.com']
reg = re.compile(i for i in urls_list)
print soup('a',{'href':reg})

and

soup = BeautifulSoup(html_source)
reg = re.compile(r"(http|https)://(www.[apps.]facebook|twitter).com/\w+")
print soup('a',{'href':reg})

above code is not working and retrieving all urls in a page. please bear with my little knowledge in regex and python

Upvotes: 0

Views: 430

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1124768

You need to produce a valid regular expression:

reg = re.compile(r"^https?://www\.(apps\.)?(facebook|twitter)\.com/[\w-]+")

Quick demo:

>>> reg = re.compile(r"^https?://www\.(apps\.)?(facebook|twitter)\.com/[\w-]+")
>>> reg.search('https://www.apps.facebook.com/hello_world')
<_sre.SRE_Match object at 0x105fe39b0>
>>> reg.search('http://www.facebook.com/hello_world')
<_sre.SRE_Match object at 0x105fe3918>
>>> reg.search('http://www.twitter.com/hello_world')
<_sre.SRE_Match object at 0x105fe39b0>
>>> reg.search('http://www.twitters.com/')
>>> reg.search('http://www.twitter.com/')
>>> reg.search('http://twitter.com/hello')

The syntax [...] creates a character class; anything within that class matches; [apps.] is the same as [aps.] in that it'll match either an a, a p, an s or a . dot. Outside of character classes, . matches any character.

Upvotes: 1

Related Questions