Reputation: 679
Hi I cannot figure out how to find links which begin with certain text for the life of me. findall('a') works fine, but it's way too much. I just want to make a list of all links that begin with http://www.nhl.com/ice/boxscore.htm?id=
Can anyone help me?
Thank you very much
Upvotes: 12
Views: 20466
Reputation: 15152
You can find all links and than filter that list to get only links that you need. This will be very fast solution regardless the fact that you filter it afterwards.
listOfAllLinks = soup.findAll('a')
listOfLinksINeed = []
for link in listOfAllLinks:
if "www.nhl.com" in link:
listOfLinksINeed.append(link['href'])
Upvotes: 1
Reputation: 554
You might not need BeautifulSoup since your search is specific
>>> import re
>>> links = re.findall("http:\/\/www\.nhl\.com\/ice\/boxscore\.htm\?id=.+", str(doc))
Upvotes: 2
Reputation: 67063
First set up a test document and open up the parser with BeautifulSoup:
>>> from BeautifulSoup import BeautifulSoup
>>> doc = '<html><body><div><a href="something">yep</a></div><div><a href="http://www.nhl.com/ice/boxscore.htm?id=3">somelink</a></div><a href="http://www.nhl.com/ice/boxscore.htm?id=7">another</a></body></html>'
>>> soup = BeautifulSoup(doc)
>>> print soup.prettify()
<html>
<body>
<div>
<a href="something">
yep
</a>
</div>
<div>
<a href="http://www.nhl.com/ice/boxscore.htm?id=3">
somelink
</a>
</div>
<a href="http://www.nhl.com/ice/boxscore.htm?id=7">
another
</a>
</body>
</html>
Next, we can search for all <a>
tags with an href
attribute starting with http://www.nhl.com/ice/boxscore.htm?id=
. You can use a regular expression for it:
>>> import re
>>> soup.findAll('a', href=re.compile('^http://www.nhl.com/ice/boxscore.htm\?id='))
[<a href="http://www.nhl.com/ice/boxscore.htm?id=3">somelink</a>, <a href="http://www.nhl.com/ice/boxscore.htm?id=7">another</a>]
Upvotes: 16