user2333196
user2333196

Reputation: 5776

pulling links and scraping those pages in python

I would like to scrape some links from this page.

http://www.covers.com/pageLoader/pageLoader.aspx?page=/data/wnba/teams/pastresults/2012/team665231.html

This gets the links that I want.

boxurl = urllib2.urlopen(url).read()
soup = BeautifulSoup(boxurl)
boxscores = soup.findAll('a', href=re.compile('boxscore'))

I would like to scrape every boxscore from the page. I have already made the code to scrape the boxscore but I don't know how to get at them.

edit

I guess this way would be better since it strips out the html tags. I still need to know how to open them.

for link in soup.find_all('a', href=re.compile('boxscore')):
    print(link.get('href'))

edit2: This is how I scrape some of the data from the first link of the page.

url = 'http://www.covers.com/pageLoader/pageLoader.aspx?page=/data/wnba/results/2012/boxscore841602.html'


boxurl = urllib2.urlopen(url).read()
soup = BeautifulSoup(boxurl)
def _unpack(row, kind='td'):
    return [val.text for val in row.findAll(kind)]

tables = soup('table')
linescore = tables[1]   
linescore_rows = linescore.findAll('tr')
roadteamQ1 = float(_unpack(linescore_rows[1])[1])
roadteamQ2 = float(_unpack(linescore_rows[1])[2])
roadteamQ3 = float(_unpack(linescore_rows[1])[3])
roadteamQ4 = float(_unpack(linescore_rows[1])[4]) 

print roadteamQ1, roadteamQ2, roadteamQ3, roadteamQ4

However when I try this.

url = 'http://www.covers.com/pageLoader/pageLoader.aspx?    page=/data/wnba/teams/pastresults/2012/team665231.html'
boxurl = urllib2.urlopen(url).read()
soup = BeautifulSoup(boxurl)

tables = pages[0]('table')
linescore = tables[1]   
linescore_rows = linescore.findAll('tr')
roadteamQ1 = float(_unpack(linescore_rows[1])[1])
roadteamQ2 = float(_unpack(linescore_rows[1])[2])
roadteamQ3 = float(_unpack(linescore_rows[1])[3])
roadteamQ4 = float(_unpack(linescore_rows[1])[4])

I get this error. tables = pages0 TypeError: 'str' object is not callable

print pages[0]

spits out all of the html of the first link like normal. Hopefully that's not too confusing. To summarize I can get the links now but still can scrape from them.

Upvotes: 0

Views: 293

Answers (1)

Vorsprung
Vorsprung

Reputation: 34297

Something like this pulls all the pages of the found links into an array, so the first page is pages[0], second pages[1] etc

boxscores = soup.findAll('a', href=re.compile('boxscore'))
basepath =  "http://www.covers.com"
pages=[]
for a in boxscores:
   pages.append(urllib2.urlopen(basepath + a['href']).read())

Upvotes: 1

Related Questions