HTML list comprehension issue while using Beautiful Soup w Python

Question

I've narrowed my HTML down and I want to pull the hrefs from each line IF the content following the a tag is past 2010. What's the best way to do this? I'll post my code first, and then the HTML.

Code:

links = [STEM_URL + row.a["href"] for row in divyclass.findAll("td") if row.a and int(row.a.contents[0]) >= 2010]

HTML:

As you can see, the issue is that the contents within the a tags start becoming ranges, not integers when we hit 1989, thus messing up our last conditional clause in the list comprehension. What's the best way to go about this?

As of now, my code predictably returns an error ValueError: invalid literal for int() with base 10: '1980-1989'

mhawke · Accepted Answer

Based on the data shown you can probably just assume that the second value in a range is greater than the first value, and that a range always spans a decade with the first year a power of 10. If that assumption is true, then your code can be as simple as this:

from urlparse import urljoin
from bs4 import BeautifulSoup

STEM = 'http://www.nba.com'    
html = '''your html here'''
html =+ '2010-2019'
soup = BeautifulSoup(html)
urls = [urlparse.urljoin(STEM, e.get('href')) for e in soup.find_all('a')
            if int(e.text.split('-')[0]) >= 2010]

If some of those assumptions are invalid, or you want to cover more possibilities, you could do this:

from urlparse import urljoin
from bs4 import BeautifulSoup

STEM = 'http://www.nba.com'    
html = '''your html here'''
html =+ '2010-2019'
html =+ '2019-2010'
html =+ '2005-2015'
soup = BeautifulSoup(html)

urls = [urlparse.urljoin(STEM, e.get('href')) for e in soup.find_all('a')
            if int(sorted(e.text.split('-'), reverse=True)[0]) >= 2010]

HTML list comprehension issue while using Beautiful Soup w Python

Answers (2)

Related Questions