Reputation: 33
I am scraping NFL passing data for years 1971 to 2019. I was able to scrape the data on the first page of each year using this code:
# This code works:
passingData = [] # create empty list to store column data
for year in range(1971,2020):
url = 'https://www.nfl.com/stats/player-stats/category/passing/%s/REG/all/passingyards/desc' % (year)
response = requests.get(url)
response = response.content
parsed_html = bsoup(response, 'html.parser')
data_rows = parsed_html.find_all('tr')
passingData.append([[col.text.strip() for col in row.find_all('td')] for row in data_rows])
The first page for each year only has 25 players, and roughly 70-90 players threw a pass each year (so there are between 3-4 pages of player data on "subpages" within each year). The problem comes when I try to scrape these subpages. I tried to add another sub-for-loop which pulls out the href of each link to the next page and append to the base url which is found in the div class 'nfl-o-table-pagination__buttons'
Unfortunately, I cannot add to the passingData list from the first page. I attempted the below, but 'Index Out of Range Error' occurred on the subUrl line.
I am still new to web scraping, so if my logic is way off please let me know. I figured I could just append the subpage data (since the table structure is the same), but seems the error arises when I attempt to go from:
https://www.nfl.com/stats/player-stats/category/passing/%s/REG/all/passingyards/desc
to the second page, which has url of :
https://www.nfl.com/stats/player-stats/category/passing/2019/REG/all/passingYards/DESC?aftercursor=0000001900000000008500100079000840a7a000000000006e00000005000000045f74626c00000010706572736f6e5f7465616d5f737461740000000565736249640000000944415234363631343100000004726f6c6500000003504c5900000008736561736f6e496400000004323031390000000a736561736f6e5479706500000003524547f07fffffe6f07fffffe6389bd3f93412939a78c1e6950d620d060004
for subPage in range(1971,2020):
subPassingData = []
subUrl = soup.select('.nfl-o-table-pagination__buttons a')[0]['href']
new = requests.get(f"{url}{subUrl}")
newResponse = new.content
soup1 = bsoup(new.text, 'html.parser')
sub_data_rows = soup1.find_all('tr')
subPassingData.append([[col.text.strip() for col in row.find_all('td')] for row in data_rows])
passingData.append(subPassingData)
Thank you for your help.
Upvotes: 3
Views: 824
Reputation: 195408
This script goes for all selected years and sub-pages and loads the data to dataframe (or you can save it to csv instead, etc...):
import requests
from bs4 import BeautifulSoup
url = 'https://www.nfl.com/stats/player-stats/category/passing/{year}/REG/all/passingyards/desc'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
all_data = []
for year in range(2017, 2020): # <-- change to desired year
soup = BeautifulSoup(requests.get(url.format(year=year), headers=headers).content, 'html.parser')
page = 1
while True:
print('Page {}/{}...'.format(page, year))
for tr in soup.select('tr:has(td)'):
tds = [year] + [td.get_text(strip=True) for td in tr.select('td')]
all_data.append(tds)
next_url = soup.select_one('.nfl-o-table-pagination__next')
if not next_url:
break
u = 'https://www.nfl.com' + next_url['href']
soup = BeautifulSoup(requests.get(u, headers=headers).content, 'html.parser')
page += 1
# here we create dataframe from the list `all_data` and print it to screen:
from pandas import pd
df = pd.DataFrame(all_data)
print(df)
Prints:
Page 1/2017...
Page 2/2017...
Page 3/2017...
Page 4/2017...
Page 1/2018...
Page 2/2018...
Page 3/2018...
Page 4/2018...
Page 1/2019...
Page 2/2019...
Page 3/2019...
Page 4/2019...
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 2017 Tom Brady 4577 7.9 581 385 0.663 32 8 102.8 230 0.396 62 10 64 35 201
1 2017 Philip Rivers 4515 7.9 575 360 0.626 28 10 96 216 0.376 61 12 75 18 120
2 2017 Matthew Stafford 4446 7.9 565 371 0.657 29 10 99.3 209 0.37 61 16 71 47 287
3 2017 Drew Brees 4334 8.1 536 386 0.72 23 8 103.9 201 0.375 72 11 54 20 145
4 2017 Ben Roethlisberger 4251 7.6 561 360 0.642 28 14 93.4 207 0.369 52 14 97 21 139
.. ... ... ... ... ... ... ... .. .. ... ... ... .. .. .. .. ...
256 2019 Trevor Siemian 3 0.5 6 3 0.5 0 0 56.3 0 0 0 0 3 2 17
257 2019 Blake Bortles 3 1.5 2 1 0.5 0 0 56.3 0 0 0 0 3 0 0
258 2019 Kenjon Barner 3 3 1 1 1 0 0 79.2 0 0 0 0 3 0 0
259 2019 Alex Tanney 1 1 1 1 1 0 0 79.2 0 0 0 0 1 0 0
260 2019 Matt Haack 1 1 1 1 1 1 0 118.8 1 1 0 0 1 0 0
[261 rows x 17 columns]
Upvotes: 2