wolfblitza
wolfblitza

Reputation: 477

Python - Web scraping

I am new to python and am trying to scrape data from the following site. Although this code worked for a different site i cannot get it to work for nextgen stats. anyone have any thoughts as to why? below is my code and the error i am getting

import pandas as pd
import numpy as np
import html5lib

urlwk1 = 'https://nextgenstats.nfl.com/stats/receiving/2020/1'
urlwk2 = 'https://nextgenstats.nfl.com/stats/receiving/2020/2'

df11 = pd.read_html(urlwk1)
df11[0].to_csv ('NFL_Receiving_Page1.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv

Below is the error I am getting

df11 = pd.read_html(urlwk1) Traceback (most recent call last): File "", line 1, in File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\util_decorators.py", line 296, in wrapper return func(*args, **kwargs) File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 1101, in read_html displayed_only=displayed_only, File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 917, in _parse raise retained File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 898, in _parse tables = p.parse_tables() File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 217, in parse_tables tables = self._parse_tables(self._build_doc(), self.match, self.attrs) File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 547, in _parse_tables raise ValueError("No tables found") ValueError: No tables found df11[0].to_csv ('NFL_Receiving_Page1.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv Traceback (most recent call last): File "", line 1, in NameError: name 'df11' is not defined

Upvotes: 1

Views: 315

Answers (2)

CodeIt
CodeIt

Reputation: 3618

Pandas pandas.read_html is not capable of parsing dynamically loading html tables.

This page is fetching that table data using an API call

You can use this below code to fetch and parse the API response

import requests
import pandas as pd

headers = {
    'accept': 'application/json, text/plain, */*',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36',
    'referer': 'https://nextgenstats.nfl.com/',
    'accept-language': 'en-US,en;q=0.9,hi;q=0.8',
}

response = requests.get('https://appapi.ngs.nfl.com/statboard/receiving?season=2020&seasonType=REG&week=2', headers=headers)

df = pd.read_json(response.content)
df.to_csv ('NFL_Receiving_Page1.csv', index=False)

See it in action here

Upvotes: 1

Ujjwal Agrawal
Ujjwal Agrawal

Reputation: 998

Read HTML using Selenium Driver and read html

I think the page address you mentioned is dynamically loading. Please refer to the post above and then try the code below.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chromedriver_path = '/home/user/chromedriver'

d = webdriver.Chrome(chromedriver_path,chrome_options=chrome_options)
d.get('https://nextgenstats.nfl.com/stats/receiving/2020/1')
time.sleep(3)
html = d.page_source
df = pd.read_html(html)

After you properly install chrome driver in whatever system you have, this code will work. Try setting time.sleep() as per your internet speed and chromedrive path as in your system.

Upvotes: 0

Related Questions