SpicyClubSauce
SpicyClubSauce

Reputation: 4266

HTML list comprehension issue while using Beautiful Soup w Python

I've narrowed my HTML down and I want to pull the hrefs from each line IF the content following the a tag is past 2010. What's the best way to do this? I'll post my code first, and then the HTML.

Code:

links = [STEM_URL + row.a["href"] for row in divyclass.findAll("td") if row.a and int(row.a.contents[0]) >= 2010]

HTML:

<td align="center" class="tableheader" colspan="4" valign="middle">NBA Drafts</td>
<td align="center" class="text" valign="middle"> </td>
<td align="center" class="text" valign="middle"> </td>
<td align="center" class="text" valign="middle"> </td>
<td align="center" class="text" valign="middle"><a href="/nba_final_draft/2014">2014</a></td>
<td align="center" class="text" valign="middle"> <a href="/nba_final_draft/2013">2013</a></td>
<td align="center" class="text" valign="middle"> <a href="/nba_final_draft/2012">2012</a></td>
<td align="center" class="text" valign="middle"><a href="/nba_final_draft/2011">2011</a></td>
<td align="center" class="text" valign="middle"><a href="/nba_final_draft/2010">2010</a></td>
<td align="center" class="text" valign="middle" width="25%"><a href="/nba_final_draft/2009">2009</a></td>
<td align="center" class="text" valign="middle" width="25%"><a href="/nba_draft_history/2008.html">2008</a></td>
...
...
<td align="center" class="text" valign="middle" width="25%"><a href="/nba_draft_history/1989.html">1980-1989</a></td>
<td align="center" class="text" valign="middle" width="25%"><a href="/nba_draft_history/1979.html">1970-1979</a></td>
<td align="center" class="text" valign="middle" width="25%"><a href="/nba_draft_history/1969.html">1960-1969</a></td>
<td align="center" class="text" valign="middle" width="25%"><a href="/nba_draft_history/1959.html">1947-1959</a></td>

As you can see, the issue is that the contents within the a tags start becoming ranges, not integers when we hit 1989, thus messing up our last conditional clause in the list comprehension. What's the best way to go about this?

As of now, my code predictably returns an error ValueError: invalid literal for int() with base 10: '1980-1989'

Upvotes: 1

Views: 581

Answers (2)

mhawke
mhawke

Reputation: 87124

Based on the data shown you can probably just assume that the second value in a range is greater than the first value, and that a range always spans a decade with the first year a power of 10. If that assumption is true, then your code can be as simple as this:

from urlparse import urljoin
from bs4 import BeautifulSoup

STEM = 'http://www.nba.com'    
html = '''your html here'''
html =+ '<a href="/nba_draft_history/2019.html">2010-2019</a>'
soup = BeautifulSoup(html)
urls = [urlparse.urljoin(STEM, e.get('href')) for e in soup.find_all('a')
            if int(e.text.split('-')[0]) >= 2010]

If some of those assumptions are invalid, or you want to cover more possibilities, you could do this:

from urlparse import urljoin
from bs4 import BeautifulSoup

STEM = 'http://www.nba.com'    
html = '''your html here'''
html =+ '<a href="/nba_draft_history/2019.html">2010-2019</a>'
html =+ '<a href="/nba_draft_history/2019.html">2019-2010</a>'
html =+ '<a href="/nba_draft_history/2015.html">2005-2015</a>'
soup = BeautifulSoup(html)

urls = [urlparse.urljoin(STEM, e.get('href')) for e in soup.find_all('a')
            if int(sorted(e.text.split('-'), reverse=True)[0]) >= 2010]

Upvotes: 1

Jiwan
Jiwan

Reputation: 731

One could do the following:

filter = lambda x: x[0] >= 2010 and x[-1] <= 2010
links = [STEM_URL + row.a["href"] for row in divyclass.findAll("td") if row.a and filter(map(int, row.a.contents[0].split('-')))]

Upvotes: 1

Related Questions