Reputation: 477
I am new to python and am trying to scrape data from the following site. Although this code worked for a different site i cannot get it to work for nextgen stats. anyone have any thoughts as to why? below is my code and the error i am getting
import pandas as pd
import numpy as np
import html5lib
urlwk1 = 'https://nextgenstats.nfl.com/stats/receiving/2020/1'
urlwk2 = 'https://nextgenstats.nfl.com/stats/receiving/2020/2'
df11 = pd.read_html(urlwk1)
df11[0].to_csv ('NFL_Receiving_Page1.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv
Below is the error I am getting
df11 = pd.read_html(urlwk1) Traceback (most recent call last): File "", line 1, in File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\util_decorators.py", line 296, in wrapper return func(*args, **kwargs) File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 1101, in read_html displayed_only=displayed_only, File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 917, in _parse raise retained File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 898, in _parse tables = p.parse_tables() File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 217, in parse_tables tables = self._parse_tables(self._build_doc(), self.match, self.attrs) File "C:\Users\USERX\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 547, in _parse_tables raise ValueError("No tables found") ValueError: No tables found df11[0].to_csv ('NFL_Receiving_Page1.csv', index=False) #index false gets rid of index listing that appears as the very first column in the csv Traceback (most recent call last): File "", line 1, in NameError: name 'df11' is not defined
Upvotes: 1
Views: 315
Reputation: 3618
Pandas pandas.read_html
is not capable of parsing dynamically loading html tables.
This page is fetching that table data using an API call
You can use this below code to fetch and parse the API response
import requests
import pandas as pd
headers = {
'accept': 'application/json, text/plain, */*',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36',
'referer': 'https://nextgenstats.nfl.com/',
'accept-language': 'en-US,en;q=0.9,hi;q=0.8',
}
response = requests.get('https://appapi.ngs.nfl.com/statboard/receiving?season=2020&seasonType=REG&week=2', headers=headers)
df = pd.read_json(response.content)
df.to_csv ('NFL_Receiving_Page1.csv', index=False)
See it in action here
Upvotes: 1
Reputation: 998
Read HTML using Selenium Driver and read html
I think the page address you mentioned is dynamically loading. Please refer to the post above and then try the code below.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chromedriver_path = '/home/user/chromedriver'
d = webdriver.Chrome(chromedriver_path,chrome_options=chrome_options)
d.get('https://nextgenstats.nfl.com/stats/receiving/2020/1')
time.sleep(3)
html = d.page_source
df = pd.read_html(html)
After you properly install chrome driver in whatever system you have, this code will work. Try setting time.sleep() as per your internet speed and chromedrive path as in your system.
Upvotes: 0