Reputation: 4266
I've narrowed my HTML down and I want to pull the hrefs from each line IF the content following the a tag is past 2010. What's the best way to do this? I'll post my code first, and then the HTML.
Code:
links = [STEM_URL + row.a["href"] for row in divyclass.findAll("td") if row.a and int(row.a.contents[0]) >= 2010]
HTML:
<td align="center" class="tableheader" colspan="4" valign="middle">NBA Drafts</td>
<td align="center" class="text" valign="middle"> </td>
<td align="center" class="text" valign="middle"> </td>
<td align="center" class="text" valign="middle"> </td>
<td align="center" class="text" valign="middle"><a href="/nba_final_draft/2014">2014</a></td>
<td align="center" class="text" valign="middle"> <a href="/nba_final_draft/2013">2013</a></td>
<td align="center" class="text" valign="middle"> <a href="/nba_final_draft/2012">2012</a></td>
<td align="center" class="text" valign="middle"><a href="/nba_final_draft/2011">2011</a></td>
<td align="center" class="text" valign="middle"><a href="/nba_final_draft/2010">2010</a></td>
<td align="center" class="text" valign="middle" width="25%"><a href="/nba_final_draft/2009">2009</a></td>
<td align="center" class="text" valign="middle" width="25%"><a href="/nba_draft_history/2008.html">2008</a></td>
...
...
<td align="center" class="text" valign="middle" width="25%"><a href="/nba_draft_history/1989.html">1980-1989</a></td>
<td align="center" class="text" valign="middle" width="25%"><a href="/nba_draft_history/1979.html">1970-1979</a></td>
<td align="center" class="text" valign="middle" width="25%"><a href="/nba_draft_history/1969.html">1960-1969</a></td>
<td align="center" class="text" valign="middle" width="25%"><a href="/nba_draft_history/1959.html">1947-1959</a></td>
As you can see, the issue is that the contents within the a tags start becoming ranges, not integers when we hit 1989, thus messing up our last conditional clause in the list comprehension. What's the best way to go about this?
As of now, my code predictably returns an error ValueError: invalid literal for int() with base 10: '1980-1989'
Upvotes: 1
Views: 581
Reputation: 87124
Based on the data shown you can probably just assume that the second value in a range is greater than the first value, and that a range always spans a decade with the first year a power of 10. If that assumption is true, then your code can be as simple as this:
from urlparse import urljoin
from bs4 import BeautifulSoup
STEM = 'http://www.nba.com'
html = '''your html here'''
html =+ '<a href="/nba_draft_history/2019.html">2010-2019</a>'
soup = BeautifulSoup(html)
urls = [urlparse.urljoin(STEM, e.get('href')) for e in soup.find_all('a')
if int(e.text.split('-')[0]) >= 2010]
If some of those assumptions are invalid, or you want to cover more possibilities, you could do this:
from urlparse import urljoin
from bs4 import BeautifulSoup
STEM = 'http://www.nba.com'
html = '''your html here'''
html =+ '<a href="/nba_draft_history/2019.html">2010-2019</a>'
html =+ '<a href="/nba_draft_history/2019.html">2019-2010</a>'
html =+ '<a href="/nba_draft_history/2015.html">2005-2015</a>'
soup = BeautifulSoup(html)
urls = [urlparse.urljoin(STEM, e.get('href')) for e in soup.find_all('a')
if int(sorted(e.text.split('-'), reverse=True)[0]) >= 2010]
Upvotes: 1
Reputation: 731
One could do the following:
filter = lambda x: x[0] >= 2010 and x[-1] <= 2010
links = [STEM_URL + row.a["href"] for row in divyclass.findAll("td") if row.a and filter(map(int, row.a.contents[0].split('-')))]
Upvotes: 1