walela
walela

Reputation: 63

BeautifulSoup looping through urls

I'm trying to harvest some chess games and got the basics done courtesy of some help here.The main function looks something like:

import requests
import urllib2
from bs4 import BeautifulSoup

r = requests.get(userurl)
soup = BeautifulSoup(r.content)
gameids= []
for link in soup.select('a[href^=/livechess/game?id=]'):
    gameid = link['href'].split("?id=")[1]
    gameids.append(int(gameid))
    return gameids

Basically what happens is that I go to the url for a specific user such as http://www.chess.com/home/game_archive?sortby=&show=live&member=Hikaru,grab the html and scrape the gameids.This works fine for one page. However some users have played lots of games and since only 50 games are displayed per page, their games are listed on multiple pages.e.g http://www.chess.com/home/game_archive?sortby=&show=live&member=Hikaru&page=2 (or 3/4/5 etc) That's where I'm stuck.How can I loop through the pages and get the ids?

Upvotes: 4

Views: 5619

Answers (1)

alecxe
alecxe

Reputation: 474221

Follow the pagination by making an endless loop and follow the "Next" link until it is not found.

In other words, from:

enter image description here

following "Next" link until:

enter image description here

Working code:

from urlparse import urljoin

import requests
from bs4 import BeautifulSoup

base_url = 'http://www.chess.com/'
game_ids = []

next_page = 'http://www.chess.com/home/game_archive?sortby=&show=live&member=Hikaru'
while True:
    soup = BeautifulSoup(requests.get(next_page).content)

    # collect the game ids
    for link in soup.select('a[href^=/livechess/game?id=]'):
        gameid = link['href'].split("?id=")[1]
        game_ids.append(int(gameid))

    try:
        next_page = urljoin(base_url, soup.select('ul.pagination li.next-on a')[0].get('href'))
    except IndexError:
        break  # exiting the loop if "Next" link not found

print game_ids

For the URL you've provided (Hikaru GM), it would print you a list of 224 game ids from all pages.

Upvotes: 5

Related Questions