Reputation: 13
I am new to coding and need some assistance. I am trying to make a web scraper for a project that involves scraping NFL roster data from 2000 to 2023 but am getting an error requesting the html. I am using Jupyter labs (Python-Pyodide) to write my code and this is the only code I have:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from io import StringIO
years = list(range(2000, 2024))
url = 'https://www.footballdb.com/teams/nfl/arizona-cardinals/roster/2023'
data = requests.get(url)
This is the error I'm getting:
(JsException: NetworkError: Failed to execute 'send' on 'XMLHttpRequest': Failed to load 'https://www.footballdb.com/teams/nfl/arizona-cardinals/roster/2023'.)
Can you explain why I am getting this error and how do i fix it?
Upvotes: -2
Views: 74
Reputation: 557
You need to send headers
with your get
request. Specifically User-Agent
. When you send this value it mocks as if the request comes from a browser e.g. a real person and not a bot/scraper. You can find this value easily by Googling "what is my user agent". Copy that entire thing; you will need it in a minute.
Declare a dict
using the value you copied:
my_headers = {
"User-Agent": "<YOUR_VALUE>"
}
Pass headers
as an argument in the get
method:
my_url = "https://www.footballdb.com/teams/nfl/arizona-cardinals/roster/2023"
data = requests.get(url=my_url, headers=my_headers)
print(data.content) # just to confirm you got the response back
Here is the scenic route to get your User-Agent and see what values are/could be there in "headers", if you're interested:
Upvotes: 0
Reputation: 1662
You didn't specify the request headers. But this page doesnt have table tags, so u cant use pd.read_html
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.footballdb.com/teams/nfl/arizona-cardinals/roster/2023"
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36'
}
result = []
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
table = soup.find('div', class_='divtable divtable-striped divtable-mobile')
table_head = [head.get_text() for head in table.find('div', class_='thead')]
for s in table.find_all('span', class_='visible-xs-inline'):
s.extract()
for row in table.find_all('div', class_='tr'):
result.append(dict(zip(table_head, [cell.get_text() for cell in row.find_all('div', class_='td')])))
df = pd.DataFrame(result)
print(df)
OUTPUT:
# Player Pos G GS Age College
0 82 Andre Baccellia WR 5 0 26 Washington
1 3 Budda Baker DB 12 12 27 Washington
2 96 Eric Banks DE 2 0 25 Texas-San Antonio
3 51 Krys Barnes LB 16 6 25 UCLA
4 66 Jackson Barton OT 1 0 28 Utah
.. .. ... .. .. .. .. ...
73 21 Garrett Williams DB 9 6 22 Syracuse
74 27 Divaad Wilson DB 2 1 23 Central Florida
75 20 Marco Wilson DB 15 11 24 Florida
76 14 Michael Wilson WR 13 12 23 Stanford
77 10 Josh Woods LB 11 7 27 Maryland
Upvotes: 0