NateRattner
NateRattner

Reputation: 67

Scraping multiple pages with Python Beautifulsoup -- only returning data from last page

I am trying to loop through multiple pages to scrape data with Python and Beautifulsoup. My script works for one page, but when trying to iterate through multiple pages, it only returns the data from the last page scraped. I think there may be something wrong in the way I am looping or storing/appending the player_data list.

Here is what I have thus far -- any help is much appreciated.

#! python3
# downloadRecruits.py - Downloads espn college basketball recruiting database info

import requests, os, bs4, csv
import pandas as pd

# Starting url (class of 2007)
base_url = 'http://www.espn.com/college-sports/basketball/recruiting/databaseresults/_/class/2007/page/'

# Number of pages to scrape (Not inclusive, so number + 1)
pages = map(str, range(1,3))

# url for starting page
url = base_url + pages[0]

for n in pages:
    # Create url
    url = base_url + n

    # Parse data using BS
    print('Downloading page %s...' % url)
    res = requests.get(url)
    res.raise_for_status()

    # Creating bs object
    soup = bs4.BeautifulSoup(res.text, "html.parser")

    table = soup.find('table')

    # Get the data
    data_rows = soup.findAll('tr')[1:]

    player_data = []
    for tr in data_rows:
        tdata = []
        for td in tr:
            tdata.append(td.getText())

            if td.div and td.div['class'][0] == 'school-logo':
                tdata.append(td.div.a['href'])

        player_data.append(tdata)

print(player_data)

Upvotes: 0

Views: 2499

Answers (3)

PRMoureu
PRMoureu

Reputation: 13347

This is an indentation issue or a declaration issue, depending on the results you expect.

  • If you need to print the result for each page:

You can solve this by adding 4 spaces before print(player_data).

If you let the print statement outside the for loop block, it will be executed only once, after the loop has ended. So the only values it can display are the last values of player_data leaking from the last iteration of the for loop.

  • if you want to store all results in player_data and print it at the end :

you must declare player_data outside and before your for loop.

player_data = []
for n in pages:
    # [...]

Upvotes: 1

import requests
from bs4 import BeautifulSoup

# Starting url (class of 2007)
base_url = 'http://www.espn.com/college-sports/basketball/recruiting/databaseresults/_/class/2007/page/'

# Number of pages to scrape (Not inclusive, so number + 1)
pages = list(map(str,range(1,3)))
# In Python 3, map returns an iterable object of type map, and not a subscriptible list, which would allow you to write map[i]. To force a list result, write
# url for starting page
url = base_url + pages[0]

for n in pages:
    # Create url
    url = base_url + n

    # Parse data using BS
    print('Downloading page %s...' % url)
    res = requests.get(url)
    res.raise_for_status()

    # Creating bs object
    soup = BeautifulSoup(res.text, "html.parser")

    table = soup.find('table')

    # Get the data
    data_rows = soup.findAll('tr')[1:]

    player_data = []
    for tr in data_rows:
        tdata = []
        for td in tr:
            tdata.append(td.getText())

            if td.div and td.div['class'][0] == 'school-logo':
                tdata.append(td.div.a['href'])

        player_data.append(tdata)

print(player_data)

Upvotes: 0

Kostas Drk
Kostas Drk

Reputation: 335

You should have your player_data list definition outside your loop, otherwise only the last iteration's results will be stored.

Upvotes: 1

Related Questions