user1851527
user1851527

Reputation: 27

Looping a scraper

I am trying to get data for all the games of a team in a regular season scraped from http://www.basketball-reference.com/boxscores/201112250DAL.html. I got all other data farming functions working fine, the problem I have is with looping the scraper. This is the test code I used to get the URL of the next page. I could use this, to get the data from all 66 games that a team played during regular season, but that's a lot of typing to scrape it this way. What would be the simplest solution to automate this?

Thank you!

URL = "http://www.basketball-reference.com/boxscores/201112250DAL.html" 

html = urlopen(URL).read()
soup = BeautifulSoup(html)

def getLink(html, soup):
    links = soup.findAll('a', attrs={'class': 'bold_text'})
    if len(links) == 2:
        a = links[0]
        a = str(a)
        a = a[37:51]
        return a
    if len(links) == 3:
        a = links[1]
        a = str(a)
        a = a[37:51]
        return a
    if len(links) == 4:
        a = links[3]
        a = str(a)
        a = a[37:51]
        return a

print getLink(html, soup)
URL1 = "http://www.basketball-reference.com/boxscores" + getLink(html, soup) + "html"
print URL1
html1 = urlopen(URL1).read()
soup1 = BeautifulSoup(html1)

print getLink(html1, soup1)

Upvotes: 0

Views: 1718

Answers (2)

That1Guy
That1Guy

Reputation: 7233

The easiest way would be to go to http://www.basketball-reference.com/teams/DAL/2012_games.html and do something like this:

URL = 'http://www.basketball-reference.com/teams/DAL/2012_games.html'
html = urllib.urlopen(URL).read()
soup = BeautifulSoup(html)

links = soup.findAll('a',text='Box Score')

This returns a list of all <a> tags with text of 'Box Score'. Test it with this:

for link in links:
    print link.parent['href']
    page_url = 'http://www.basketball-reference.com' + link.parent['href']

From here, make another request to page_url and continue coding.

This is the entire code I used, and it worked perfectly for me:

from BeautifulSoup import BeautifulSoup
import urllib


url = 'http://www.basketball-reference.com/teams/DAL/2012_games.html'
file_pointer = urllib.urlopen(url)
soup = BeautifulSoup(file_pointer)

links = soup.findAll('a',text='Box Score')
for link in links:
    print link.parent['href']

Upvotes: 3

dm03514
dm03514

Reputation: 55972

The easiest easiest way would be to use scrapy. Which follow links for you automatically.

It allows you to easily create complex rules on which urls to follow and ignore. Scrapy will then follow any url that matches your rules. It does require you to learn how scrapy works, but they provide an excellent quick tutorial on how to get started.

Upvotes: 0

Related Questions