Reputation: 63
I'm trying to harvest some chess games and got the basics done courtesy of some help here.The main function looks something like:
import requests
import urllib2
from bs4 import BeautifulSoup
r = requests.get(userurl)
soup = BeautifulSoup(r.content)
gameids= []
for link in soup.select('a[href^=/livechess/game?id=]'):
gameid = link['href'].split("?id=")[1]
gameids.append(int(gameid))
return gameids
Basically what happens is that I go to the url for a specific user such as http://www.chess.com/home/game_archive?sortby=&show=live&member=Hikaru,grab the html and scrape the gameids.This works fine for one page. However some users have played lots of games and since only 50 games are displayed per page, their games are listed on multiple pages.e.g http://www.chess.com/home/game_archive?sortby=&show=live&member=Hikaru&page=2 (or 3/4/5 etc) That's where I'm stuck.How can I loop through the pages and get the ids?
Upvotes: 4
Views: 5619
Reputation: 474221
Follow the pagination by making an endless loop and follow the "Next" link until it is not found.
In other words, from:
following "Next" link until:
Working code:
from urlparse import urljoin
import requests
from bs4 import BeautifulSoup
base_url = 'http://www.chess.com/'
game_ids = []
next_page = 'http://www.chess.com/home/game_archive?sortby=&show=live&member=Hikaru'
while True:
soup = BeautifulSoup(requests.get(next_page).content)
# collect the game ids
for link in soup.select('a[href^=/livechess/game?id=]'):
gameid = link['href'].split("?id=")[1]
game_ids.append(int(gameid))
try:
next_page = urljoin(base_url, soup.select('ul.pagination li.next-on a')[0].get('href'))
except IndexError:
break # exiting the loop if "Next" link not found
print game_ids
For the URL you've provided (Hikaru
GM), it would print you a list of 224 game ids from all pages.
Upvotes: 5