user3768804
user3768804

Reputation: 139

Need help scraping an NHL statistics table with lxml and xpath

I am new to python(using python3.6), I am learning it mainly to be able to build a scraper for this page http://www.nhl.com/stats/player?aggregate=0&gameType=2&report=skatersummary&pos=S&reportType=season&seasonFrom=20162017&seasonTo=20162017&filter=gamesPlayed,gte,1&sort=points,goals,assists

I have tried many things, I originally wanted to try with xpath but after failing, I decide to try with BeautifulSoup4 and I am getting this error

    for row in soup('table', {'class': 'stat-table'})[0].tbody('tr'):
IndexError: list index out of range

from this code

import urllib.request
from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib.request.urlopen('http://www.nhl.com/stats/player?aggregate=0&gameType=2&report=skatersummary&pos=S&reportType=season&seasonFrom=20162017&seasonTo=20162017&filter=gamesPlayed,gte,1&sort=points,goals,assists'),"lxml")

for row in soup('table', {'class': 'stat-table'})[0].tbody('tr'):
    tds = row('td')
    print(tds[0].string, tds[1].string)

Upvotes: 1

Views: 1335

Answers (1)

nguaman
nguaman

Reputation: 971

To make this works, you have to find the correct url who make the requests to the internal API.

To get the url you have to use the web console of google chrome.

1) open the console and make click in "Network"

enter image description here

2) then refresh the website and you will see all the requests from this page.

enter image description here

3) then you have to filter by "XHR" , and there you go!

enter image description here

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests
import lxml.html
from pprint import pprint 
from sys import exit
import json
import csv

url = 'http://www.nhl.com/stats/rest/grouped/skaters/basic/season/skatersummary?cayenneExp=seasonId=20162017 and gameTypeId=2&factCayenneExp=gamesPlayed>=1&sort=[{"property":"points","direction":"DESC"},{"property":"goals","direction":"DESC"},{"property":"assists","direction":"DESC"}]'
resp = requests.get(url).text
resp = json.loads(resp)

pprint(resp['data'])

Upvotes: 4

Related Questions