Joseph K
Joseph K

Reputation: 17

Python scrape over Webpage Loop

I am scraping two columns out of a table and looping the script over the HTML (there are 19 pages of tables). However, when I enter in the range for what is supposed to be the webpage loop, it sets it as the range of rows to gain.

What am I doing wrong with my loop so that it is setting the range for the rows of data gathered INSTEAD of setting the range for the HTML pages I want to scrape over?

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
empty_list = []
for i in range (1,19):
    url = requests.get("https://www.foxsports.com/nhl/stats?season=2017&category=SCORING&group=1&sort=3&time=0&pos=0&team=0&qual=1&sortOrder=0&page={}".format(i))
    if not url.ok:
        continue
    data = url.text
    soup = BeautifulSoup(data, 'lxml')
    table = soup.find('table', {'class' : 'wisbb_standardTable'})
    player = table.find('a', {'class':'wisbb_fullPlayer'}).find('span').text
    team = table.find('span',{'class':'wisbb_tableAbbrevLink'}).find('a').text
    empty_list.append((player, team))
df = pd.DataFrame(empty_list, columns=["player", "team"])
df

sample table data

Upvotes: 0

Views: 50

Answers (1)

briancaffey
briancaffey

Reputation: 2559

When you use find, it finds the first element. You should use find_all instead. This gives you an array of all elements that match, then you can call find on each element in the array to get the data you need. You are just grabbing the first team, player pair for each of range(1,n) pages.

This code seems to give you what you are looking for:

import pandas as pd
import csv
empty_list = []
for i in range (1,19):
    url = requests.get("https://www.foxsports.com/nhl/stats?season=2017&category=SCORING&group=1&sort=3&time=0&pos=0&team=0&qual=1&sortOrder=0&page={}".format(i))
    if not url.ok:
        continue
    data = url.text
    soup = BeautifulSoup(data, 'lxml')
    table = soup.find('table', {'class' : 'wisbb_standardTable'})
    player = table.find_all('a', {'class':'wisbb_fullPlayer'})
    team = table.find_all('span',{'class':'wisbb_tableAbbrevLink'})
    player_team_data = [{"player":p.text.split('\n')[1], "team":t.text.strip('\n')} for p,t in zip(player,team)]
    for p in player_team_data:
        empty_list.append(p)
df = pd.DataFrame(empty_list, columns=["player", "team"])

df.shape

(900, 2)

Upvotes: 1

Related Questions