Reputation:

Scraping table by beautiful soup 4

Hello I am trying to scrape this table in this url: https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc

There are 50 rows in this table.. however if you click Show more (just below the table), more of the rows appear. My beautiful soup code works fine, But the problem is it retrieves only the first 50 rows. It doesnot retrieve rows that appear after clicking the Show more. How can i get all the rows including first 50 and also those appears after clicking Show more? Here is the code:

#Request to get the target wiki page
rqst = requests.get("https://www.espn.com/nfl/stats/player/_/stat/rushing/season/2018/seasontype/2/table/rushing/sort/rushingYards/dir/desc")
soup = BeautifulSoup(rqst.content,'lxml')
table = soup.find_all('table')
NFL_player_stats = pd.read_html(str(table))
players = NFL_player_stats[0]
players.shape

out[0]:  (50,1)

Upvotes: 1

Answers (1)

furas

Reputation: 143187

Using DevTools in Firefox I see it gets data (in JSON format) for next page from

https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page=2

If you change value in page= then you can get other pages.

import requests

url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page='

for page in range(1, 4):
    print('\n---', page, '---\n')

    r = requests.get(url + str(page))
    data = r.json()

    #print(data.keys())

    for item in data['athletes']:
        print(item['athlete']['displayName'])

Result:

--- 1 ---

Ezekiel Elliott
Saquon Barkley
Todd Gurley II
Joe Mixon
Chris Carson
Christian McCaffrey
Derrick Henry
Adrian Peterson
Phillip Lindsay
Nick Chubb
Lamar Miller
James Conner
David Johnson
Jordan Howard
Sony Michel
Marlon Mack
Melvin Gordon
Alvin Kamara
Peyton Barber
Kareem Hunt
Matt Breida
Tevin Coleman
Aaron Jones
Doug Martin
Frank Gore
Gus Edwards
Lamar Jackson
Isaiah Crowell
Mark Ingram II
Kerryon Johnson
Josh Allen
Dalvin Cook
Latavius Murray
Carlos Hyde
Austin Ekeler
Deshaun Watson
Kenyan Drake
Royce Freeman
Dion Lewis
LeSean McCoy
Mike Davis
Josh Adams
Alfred Blue
Cam Newton
Jamaal Williams
Tarik Cohen
Leonard Fournette
Alfred Morris
James White
Mitchell Trubisky

--- 2 ---

Rashaad Penny
LeGarrette Blount
T.J. Yeldon
Alex Collins
C.J. Anderson
Chris Ivory
Marshawn Lynch
Russell Wilson
Blake Bortles
Wendell Smallwood
Marcus Mariota
Bilal Powell
Jordan Wilkins
Kenneth Dixon
Ito Smith
Nyheim Hines
Dak Prescott
Jameis Winston
Elijah McGuire
Patrick Mahomes
Aaron Rodgers
Jeff Wilson Jr.
Zach Zenner
Raheem Mostert
Corey Clement
Jalen Richard
Damien Williams
Jaylen Samuels
Marcus Murphy
Spencer Ware
Cordarrelle Patterson
Malcolm Brown
Giovani Bernard
Chase Edmonds
Justin Jackson
Duke Johnson
Taysom Hill
Kalen Ballage
Ty Montgomery
Rex Burkhead
Jay Ajayi
Devontae Booker
Chris Thompson
Wayne Gallman
DJ Moore
Theo Riddick
Alex Smith
Robert Woods
Brian Hill
Dwayne Washington

--- 3 ---

Ryan Fitzpatrick
Tyreek Hill
Andrew Luck
Ryan Tannehill
Josh Rosen
Sam Darnold
Baker Mayfield
Jeff Driskel
Rod Smith
Matt Ryan
Tyrod Taylor
Kirk Cousins
Cody Kessler
Darren Sproles
Josh Johnson
DeAndre Washington
Trenton Cannon
Javorius Allen
Jared Goff
Julian Edelman
Jacquizz Rodgers
Kapri Bibbs
Andy Dalton
Ben Roethlisberger
Dede Westbrook
Case Keenum
Carson Wentz
Brandon Bolden
Curtis Samuel
Stevan Ridley
Keith Ford
Keenan Allen
John Kelly
Kenjon Barner
Matthew Stafford
Tyler Lockett
C.J. Beathard
Cameron Artis-Payne
Devonta Freeman
Brandin Cooks
Isaiah McKenzie
Colt McCoy
Stefon Diggs
Taylor Gabriel
Jarvis Landry
Tavon Austin
Corey Davis
Emmanuel Sanders
Sammy Watkins
Nathan Peterman

EDIT: get all data as DataFrame

import requests
import pandas as pd

url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/statistics/byathlete?region=us&lang=en&contentorigin=espn&isqualified=false&limit=50&category=offense%3Arushing&sort=rushing.rushingYards%3Adesc&season=2018&seasontype=2&page='

df = pd.DataFrame() # emtpy DF at start

for page in range(1, 4):
    print('page:', page)

    r = requests.get(url + str(page))
    data = r.json()

    #print(data.keys())

    for item in data['athletes']:
        player_name = item['athlete']['displayName']
        position = item['athlete']['position']['abbreviation']
        gp = item['categories'][0]['totals'][0]
        other_values = item['categories'][2]['totals']
        row = [player_name, position, gp] + other_values

        df = df.append( [row] ) # append one row

df.columns = ['NAME', 'POS', 'GP', 'ATT', 'YDS', 'AVG', 'LNG', 'BIG', 'TD', 'YDS/G', 'FUM', 'LST', 'FD']

print(len(df)) # 150
print(df.head(20))

Upvotes: 2

Scraping table by beautiful soup 4

Answers (1)

Related Questions