Reputation: 468
While using BeautifulSoup and parsing a url, I am running into this error:
Traceback (most recent call last):
File "/Users/justinhudacsko/PycharmProjects/SportsBot/scrape.py", line 8, in <module>
stats_page = BeautifulSoup(comment, "lxml")
File "/usr/local/lib/python3.9/site-packages/bs4/__init__.py", line 310, in __init__
elif len(markup) <= 256 and (
TypeError: object of type 'NoneType' has no len()
And my code is:
from urllib.request import urlopen
from bs4 import BeautifulSoup, Comment
url = 'https://www.pro-football-reference.com/years/2020/draft.htm'
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")
comment = soup.find(text=lambda text: isinstance(text, Comment) and 'class="table_outer_container"' in text) #THIS RETURNS NONE
stats_page = BeautifulSoup(comment, "lxml")
Why does the variable comment
have None
as its value, even though there are instances of class="table_outer_container"
in this url?
Upvotes: 1
Views: 942
Reputation: 28565
Keep Marc's answer as it answers your question. However, I'd like to offer a further alternative to using BeautifulSoup here. Pandas
has the .read_html()
method that actually uses BeautifulSoup under the hood to parse tables. As long the data is within a <table>
tag, let pandas do the parsing for you. Then you just need to clean up the dataframe, as opposed to working out all the logic of iterating through <tr>
, <th>
, and <td>
tags.
import pandas as pd
url = 'https://www.pro-football-reference.com/years/2020/draft.htm'
df = pd.read_html(url, header=1)[0].iloc[:,:-1]
df = df[~(df['Rnd'] == 'Rnd')].reset_index(drop=True)
Output:
print (df.head(35).to_string())
Rnd Pick Tm Player Pos Age To AP1 PB St CarAV DrAV G Cmp Att Yds TD Int Att.1 Yds.1 TD.1 Rec Yds.2 TD.2 Solo Int.1 Sk College/Univ
0 1 1 CIN Joe Burrow QB 23 2020 0 0 1 7 7 10 264 404 2688 13 5 37 142 3 0 0 0 NaN NaN NaN LSU
1 1 2 WAS Chase Young DE 21 2020 0 1 1 13 13 15 0 0 0 0 0 0 0 0 0 0 0 32 NaN 7.5 Ohio St.
2 1 3 DET Jeff Okudah CB 21 2020 0 0 0 2 2 9 0 0 0 0 0 0 0 0 0 0 0 41 1 NaN Ohio St.
3 1 4 NYG Andrew Thomas T 21 2020 0 0 1 6 6 16 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN Georgia
4 1 5 MIA Tua Tagovailoa QB 22 2020 0 0 0 5 5 10 186 290 1814 11 5 36 109 3 0 0 0 NaN NaN NaN Alabama
5 1 6 LAC Justin Herbert QB 22 2020 0 0 1 13 13 15 396 595 4336 31 10 55 234 5 0 0 0 NaN NaN NaN Oregon
6 1 7 CAR Derrick Brown DT 22 2020 0 0 1 7 7 16 0 0 0 0 0 0 0 0 0 0 0 21 NaN 2.0 Auburn
7 1 8 ARI Isaiah Simmons LB 22 2020 0 0 0 5 5 16 0 0 0 0 0 0 0 0 0 0 0 43 1 2.0 Clemson
8 1 9 JAX C.J. Henderson CB 21 2020 0 0 0 3 3 8 0 0 0 0 0 0 0 0 0 0 0 27 1 NaN Florida
9 1 10 CLE Jedrick Wills Jr. T 21 2020 0 0 1 8 8 15 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN Alabama
10 1 11 NYJ Mekhi Becton T 21 2020 0 0 1 5 5 14 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN Louisville
11 1 12 LVR Henry Ruggs III WR 21 2020 0 0 1 4 4 13 0 0 0 0 0 9 49 0 26 452 2 NaN NaN NaN Alabama
12 1 13 TAM Tristan Wirfs T 21 2020 0 0 1 11 11 16 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN Iowa
13 1 14 SFO Javon Kinlaw DT 22 2020 0 0 1 7 7 14 0 0 0 0 0 0 0 0 0 0 0 15 1 1.5 South Carolina
14 1 15 DEN Jerry Jeudy WR 21 2020 0 0 1 6 6 16 0 0 0 0 0 0 0 0 52 856 3 NaN NaN NaN Alabama
15 1 16 ATL AJ Terrell CB 21 2020 0 0 1 5 5 14 0 0 0 0 0 0 0 0 0 0 0 61 1 NaN Clemson
16 1 17 DAL CeeDee Lamb WR 21 2020 0 0 1 8 8 16 0 0 0 0 0 10 82 1 74 935 5 NaN NaN NaN Oklahoma
17 1 18 MIA Austin Jackson T 21 2020 0 0 1 6 6 13 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN USC
18 1 19 LVR Damon Arnette CB 24 2020 0 0 0 2 2 9 0 0 0 0 0 0 0 0 0 0 0 21 NaN NaN Ohio St.
19 1 20 JAX K'Lavon Chaisson DE 21 2020 0 0 0 2 2 16 0 0 0 0 0 0 0 0 0 0 0 12 NaN 1.0 LSU
20 1 21 PHI Jalen Reagor WR 21 2020 0 0 1 4 4 11 0 0 0 0 0 4 26 0 31 396 1 NaN NaN NaN TCU
21 1 22 MIN Justin Jefferson WR 21 2020 0 1 1 12 12 16 0 0 0 0 0 1 2 0 88 1400 7 NaN NaN NaN LSU
22 1 23 LAC Kenneth Murray LB 21 2020 0 0 1 8 8 16 0 0 0 0 0 0 0 0 0 0 0 68 NaN 1.0 Oklahoma
23 1 24 NOR Cesar Ruiz C 21 2020 0 0 0 5 5 15 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN Michigan
24 1 25 SFO Brandon Aiyuk WR 22 2020 0 0 1 6 6 12 0 0 0 0 0 6 77 2 60 748 5 NaN NaN NaN Arizona St.
25 1 26 GNB Jordan Love QB 21 NaN 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Utah St.
26 1 27 SEA Jordyn Brooks LB 22 2020 0 0 0 3 3 14 0 0 0 0 0 0 0 0 0 0 0 35 NaN NaN Texas Tech
27 1 28 BAL Patrick Queen LB 21 2020 0 0 1 10 10 16 0 0 0 0 0 0 0 0 0 0 0 66 1 3.0 LSU
28 1 29 TEN Isaiah Wilson T 21 2020 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN Georgia
29 1 30 MIA Noah Igbinoghene CB 20 2020 0 0 0 2 2 16 0 0 0 0 0 0 0 0 0 0 0 11 NaN NaN Auburn
30 1 31 MIN Jeff Gladney CB 23 2020 0 0 1 5 5 16 0 0 0 0 0 0 0 0 0 0 0 63 NaN NaN TCU
31 1 32 KAN Clyde Edwards-Helaire RB 21 2020 0 0 1 8 8 13 0 0 0 0 0 181 803 4 36 297 1 NaN NaN NaN LSU
32 2 33 CIN Tee Higgins WR 21 2020 0 0 1 6 6 16 0 0 0 0 0 5 28 0 67 908 6 NaN NaN NaN Clemson
33 2 34 IND Michael Pittman Jr. WR 22 2020 0 0 0 4 4 13 0 0 0 0 0 3 26 0 40 503 1 NaN NaN NaN USC
34 2 35 DET D'Andre Swift RB 21 2020 0 0 0 6 6 13 0 0 0 0 0 114 521 8 46 357 2 NaN NaN NaN Georgia
Upvotes: 1
Reputation: 501
The find
method you're using will only return HTML comments that contain 'class="table_outer_container"'
whereas I assume that you wanted to get the content of the element whose class is table_outer_container
You can do this as follow:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.pro-football-reference.com/years/2020/draft.htm'
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")
table = soup.find('div', class_='table_outer_container')
Upvotes: 3