Reputation: 17
I am scraping two columns out of a table and looping the script over the HTML (there are 19 pages of tables). However, when I enter in the range for what is supposed to be the webpage loop, it sets it as the range of rows to gain.
What am I doing wrong with my loop so that it is setting the range for the rows of data gathered INSTEAD of setting the range for the HTML pages I want to scrape over?
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
empty_list = []
for i in range (1,19):
url = requests.get("https://www.foxsports.com/nhl/stats?season=2017&category=SCORING&group=1&sort=3&time=0&pos=0&team=0&qual=1&sortOrder=0&page={}".format(i))
if not url.ok:
continue
data = url.text
soup = BeautifulSoup(data, 'lxml')
table = soup.find('table', {'class' : 'wisbb_standardTable'})
player = table.find('a', {'class':'wisbb_fullPlayer'}).find('span').text
team = table.find('span',{'class':'wisbb_tableAbbrevLink'}).find('a').text
empty_list.append((player, team))
df = pd.DataFrame(empty_list, columns=["player", "team"])
df
Upvotes: 0
Views: 50
Reputation: 2559
When you use find
, it finds the first element. You should use find_all
instead. This gives you an array of all elements that match, then you can call find
on each element in the array to get the data you need. You are just grabbing the first team, player pair for each of range(1,n)
pages.
This code seems to give you what you are looking for:
import pandas as pd
import csv
empty_list = []
for i in range (1,19):
url = requests.get("https://www.foxsports.com/nhl/stats?season=2017&category=SCORING&group=1&sort=3&time=0&pos=0&team=0&qual=1&sortOrder=0&page={}".format(i))
if not url.ok:
continue
data = url.text
soup = BeautifulSoup(data, 'lxml')
table = soup.find('table', {'class' : 'wisbb_standardTable'})
player = table.find_all('a', {'class':'wisbb_fullPlayer'})
team = table.find_all('span',{'class':'wisbb_tableAbbrevLink'})
player_team_data = [{"player":p.text.split('\n')[1], "team":t.text.strip('\n')} for p,t in zip(player,team)]
for p in player_team_data:
empty_list.append(p)
df = pd.DataFrame(empty_list, columns=["player", "team"])
df.shape
(900, 2)
Upvotes: 1