Reputation: 67
I am trying to loop through multiple pages to scrape data with Python and Beautifulsoup. My script works for one page, but when trying to iterate through multiple pages, it only returns the data from the last page scraped. I think there may be something wrong in the way I am looping or storing/appending the player_data
list.
Here is what I have thus far -- any help is much appreciated.
#! python3
# downloadRecruits.py - Downloads espn college basketball recruiting database info
import requests, os, bs4, csv
import pandas as pd
# Starting url (class of 2007)
base_url = 'http://www.espn.com/college-sports/basketball/recruiting/databaseresults/_/class/2007/page/'
# Number of pages to scrape (Not inclusive, so number + 1)
pages = map(str, range(1,3))
# url for starting page
url = base_url + pages[0]
for n in pages:
# Create url
url = base_url + n
# Parse data using BS
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
# Creating bs object
soup = bs4.BeautifulSoup(res.text, "html.parser")
table = soup.find('table')
# Get the data
data_rows = soup.findAll('tr')[1:]
player_data = []
for tr in data_rows:
tdata = []
for td in tr:
tdata.append(td.getText())
if td.div and td.div['class'][0] == 'school-logo':
tdata.append(td.div.a['href'])
player_data.append(tdata)
print(player_data)
Upvotes: 0
Views: 2499
Reputation: 13347
This is an indentation issue or a declaration issue, depending on the results you expect.
You can solve this by adding 4 spaces before print(player_data).
If you let the print statement outside the for loop block, it will be executed only once, after the loop has ended. So the only values it can display are the last values of player_data
leaking from the last iteration of the for loop.
player_data
and print it at the end :you must declare player_data
outside and before your for loop.
player_data = []
for n in pages:
# [...]
Upvotes: 1
Reputation: 1
import requests
from bs4 import BeautifulSoup
# Starting url (class of 2007)
base_url = 'http://www.espn.com/college-sports/basketball/recruiting/databaseresults/_/class/2007/page/'
# Number of pages to scrape (Not inclusive, so number + 1)
pages = list(map(str,range(1,3)))
# In Python 3, map returns an iterable object of type map, and not a subscriptible list, which would allow you to write map[i]. To force a list result, write
# url for starting page
url = base_url + pages[0]
for n in pages:
# Create url
url = base_url + n
# Parse data using BS
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
# Creating bs object
soup = BeautifulSoup(res.text, "html.parser")
table = soup.find('table')
# Get the data
data_rows = soup.findAll('tr')[1:]
player_data = []
for tr in data_rows:
tdata = []
for td in tr:
tdata.append(td.getText())
if td.div and td.div['class'][0] == 'school-logo':
tdata.append(td.div.a['href'])
player_data.append(tdata)
print(player_data)
Upvotes: 0
Reputation: 335
You should have your player_data
list definition outside your loop, otherwise only the last iteration's results will be stored.
Upvotes: 1