Reputation: 121
i am working on web scraping using beautifulsoup and trying to get links in a html page for given list of urls.
suppose if i want to get facebook and twitter links in a page, I tried
urls_list = ['www.facebook.com','www.apps.facebook.com', 'www.twitter.com']
reg = re.compile(i for i in urls_list)
print soup('a',{'href':reg})
and
soup = BeautifulSoup(html_source)
reg = re.compile(r"(http|https)://(www.[apps.]facebook|twitter).com/\w+")
print soup('a',{'href':reg})
above code is not working and retrieving all urls in a page. please bear with my little knowledge in regex and python
Upvotes: 0
Views: 430
Reputation: 1124768
You need to produce a valid regular expression:
reg = re.compile(r"^https?://www\.(apps\.)?(facebook|twitter)\.com/[\w-]+")
Quick demo:
>>> reg = re.compile(r"^https?://www\.(apps\.)?(facebook|twitter)\.com/[\w-]+")
>>> reg.search('https://www.apps.facebook.com/hello_world')
<_sre.SRE_Match object at 0x105fe39b0>
>>> reg.search('http://www.facebook.com/hello_world')
<_sre.SRE_Match object at 0x105fe3918>
>>> reg.search('http://www.twitter.com/hello_world')
<_sre.SRE_Match object at 0x105fe39b0>
>>> reg.search('http://www.twitters.com/')
>>> reg.search('http://www.twitter.com/')
>>> reg.search('http://twitter.com/hello')
The syntax [...]
creates a character class; anything within that class matches; [apps.]
is the same as [aps.]
in that it'll match either an a
, a p
, an s
or a .
dot. Outside of character classes, .
matches any character.
Upvotes: 1