earningjoker430
earningjoker430

Reputation: 468

BeautifulSoup Error: TypeError: object of type 'NoneType' has no len()

While using BeautifulSoup and parsing a url, I am running into this error:

Traceback (most recent call last):
  File "/Users/justinhudacsko/PycharmProjects/SportsBot/scrape.py", line 8, in <module>
    stats_page = BeautifulSoup(comment, "lxml")
  File "/usr/local/lib/python3.9/site-packages/bs4/__init__.py", line 310, in __init__
    elif len(markup) <= 256 and (
TypeError: object of type 'NoneType' has no len()

And my code is:

from urllib.request import urlopen
from bs4 import BeautifulSoup, Comment

url = 'https://www.pro-football-reference.com/years/2020/draft.htm'
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")
comment = soup.find(text=lambda text: isinstance(text, Comment) and 'class="table_outer_container"' in text) #THIS RETURNS NONE
stats_page = BeautifulSoup(comment, "lxml")

Why does the variable comment have None as its value, even though there are instances of class="table_outer_container" in this url?

Upvotes: 1

Views: 942

Answers (2)

chitown88
chitown88

Reputation: 28565

Keep Marc's answer as it answers your question. However, I'd like to offer a further alternative to using BeautifulSoup here. Pandas has the .read_html() method that actually uses BeautifulSoup under the hood to parse tables. As long the data is within a <table> tag, let pandas do the parsing for you. Then you just need to clean up the dataframe, as opposed to working out all the logic of iterating through <tr>, <th>, and <td> tags.

import pandas as pd

url = 'https://www.pro-football-reference.com/years/2020/draft.htm'
df = pd.read_html(url, header=1)[0].iloc[:,:-1]
df = df[~(df['Rnd'] == 'Rnd')].reset_index(drop=True)

Output:

print (df.head(35).to_string())
   Rnd Pick   Tm                 Player Pos Age    To AP1 PB St CarAV DrAV    G  Cmp  Att   Yds   TD  Int Att.1 Yds.1 TD.1  Rec Yds.2 TD.2 Solo Int.1   Sk    College/Univ
0    1    1  CIN             Joe Burrow  QB  23  2020   0  0  1     7    7   10  264  404  2688   13    5    37   142    3    0     0    0  NaN   NaN  NaN             LSU
1    1    2  WAS            Chase Young  DE  21  2020   0  1  1    13   13   15    0    0     0    0    0     0     0    0    0     0    0   32   NaN  7.5        Ohio St.
2    1    3  DET            Jeff Okudah  CB  21  2020   0  0  0     2    2    9    0    0     0    0    0     0     0    0    0     0    0   41     1  NaN        Ohio St.
3    1    4  NYG          Andrew Thomas   T  21  2020   0  0  1     6    6   16    0    0     0    0    0     0     0    0    0     0    0  NaN   NaN  NaN         Georgia
4    1    5  MIA         Tua Tagovailoa  QB  22  2020   0  0  0     5    5   10  186  290  1814   11    5    36   109    3    0     0    0  NaN   NaN  NaN         Alabama
5    1    6  LAC         Justin Herbert  QB  22  2020   0  0  1    13   13   15  396  595  4336   31   10    55   234    5    0     0    0  NaN   NaN  NaN          Oregon
6    1    7  CAR          Derrick Brown  DT  22  2020   0  0  1     7    7   16    0    0     0    0    0     0     0    0    0     0    0   21   NaN  2.0          Auburn
7    1    8  ARI         Isaiah Simmons  LB  22  2020   0  0  0     5    5   16    0    0     0    0    0     0     0    0    0     0    0   43     1  2.0         Clemson
8    1    9  JAX         C.J. Henderson  CB  21  2020   0  0  0     3    3    8    0    0     0    0    0     0     0    0    0     0    0   27     1  NaN         Florida
9    1   10  CLE      Jedrick Wills Jr.   T  21  2020   0  0  1     8    8   15    0    0     0    0    0     0     0    0    0     0    0  NaN   NaN  NaN         Alabama
10   1   11  NYJ           Mekhi Becton   T  21  2020   0  0  1     5    5   14    0    0     0    0    0     0     0    0    0     0    0  NaN   NaN  NaN      Louisville
11   1   12  LVR        Henry Ruggs III  WR  21  2020   0  0  1     4    4   13    0    0     0    0    0     9    49    0   26   452    2  NaN   NaN  NaN         Alabama
12   1   13  TAM          Tristan Wirfs   T  21  2020   0  0  1    11   11   16    0    0     0    0    0     0     0    0    0     0    0  NaN   NaN  NaN            Iowa
13   1   14  SFO           Javon Kinlaw  DT  22  2020   0  0  1     7    7   14    0    0     0    0    0     0     0    0    0     0    0   15     1  1.5  South Carolina
14   1   15  DEN            Jerry Jeudy  WR  21  2020   0  0  1     6    6   16    0    0     0    0    0     0     0    0   52   856    3  NaN   NaN  NaN         Alabama
15   1   16  ATL             AJ Terrell  CB  21  2020   0  0  1     5    5   14    0    0     0    0    0     0     0    0    0     0    0   61     1  NaN         Clemson
16   1   17  DAL            CeeDee Lamb  WR  21  2020   0  0  1     8    8   16    0    0     0    0    0    10    82    1   74   935    5  NaN   NaN  NaN        Oklahoma
17   1   18  MIA         Austin Jackson   T  21  2020   0  0  1     6    6   13    0    0     0    0    0     0     0    0    0     0    0  NaN   NaN  NaN             USC
18   1   19  LVR          Damon Arnette  CB  24  2020   0  0  0     2    2    9    0    0     0    0    0     0     0    0    0     0    0   21   NaN  NaN        Ohio St.
19   1   20  JAX       K'Lavon Chaisson  DE  21  2020   0  0  0     2    2   16    0    0     0    0    0     0     0    0    0     0    0   12   NaN  1.0             LSU
20   1   21  PHI           Jalen Reagor  WR  21  2020   0  0  1     4    4   11    0    0     0    0    0     4    26    0   31   396    1  NaN   NaN  NaN             TCU
21   1   22  MIN       Justin Jefferson  WR  21  2020   0  1  1    12   12   16    0    0     0    0    0     1     2    0   88  1400    7  NaN   NaN  NaN             LSU
22   1   23  LAC         Kenneth Murray  LB  21  2020   0  0  1     8    8   16    0    0     0    0    0     0     0    0    0     0    0   68   NaN  1.0        Oklahoma
23   1   24  NOR             Cesar Ruiz   C  21  2020   0  0  0     5    5   15    0    0     0    0    0     0     0    0    0     0    0  NaN   NaN  NaN        Michigan
24   1   25  SFO          Brandon Aiyuk  WR  22  2020   0  0  1     6    6   12    0    0     0    0    0     6    77    2   60   748    5  NaN   NaN  NaN     Arizona St.
25   1   26  GNB            Jordan Love  QB  21   NaN   0  0  0   NaN  NaN  NaN  NaN  NaN   NaN  NaN  NaN   NaN   NaN  NaN  NaN   NaN  NaN  NaN   NaN  NaN        Utah St.
26   1   27  SEA          Jordyn Brooks  LB  22  2020   0  0  0     3    3   14    0    0     0    0    0     0     0    0    0     0    0   35   NaN  NaN      Texas Tech
27   1   28  BAL          Patrick Queen  LB  21  2020   0  0  1    10   10   16    0    0     0    0    0     0     0    0    0     0    0   66     1  3.0             LSU
28   1   29  TEN          Isaiah Wilson   T  21  2020   0  0  0     0    0    1    0    0     0    0    0     0     0    0    0     0    0  NaN   NaN  NaN         Georgia
29   1   30  MIA       Noah Igbinoghene  CB  20  2020   0  0  0     2    2   16    0    0     0    0    0     0     0    0    0     0    0   11   NaN  NaN          Auburn
30   1   31  MIN           Jeff Gladney  CB  23  2020   0  0  1     5    5   16    0    0     0    0    0     0     0    0    0     0    0   63   NaN  NaN             TCU
31   1   32  KAN  Clyde Edwards-Helaire  RB  21  2020   0  0  1     8    8   13    0    0     0    0    0   181   803    4   36   297    1  NaN   NaN  NaN             LSU
32   2   33  CIN            Tee Higgins  WR  21  2020   0  0  1     6    6   16    0    0     0    0    0     5    28    0   67   908    6  NaN   NaN  NaN         Clemson
33   2   34  IND    Michael Pittman Jr.  WR  22  2020   0  0  0     4    4   13    0    0     0    0    0     3    26    0   40   503    1  NaN   NaN  NaN             USC
34   2   35  DET          D'Andre Swift  RB  21  2020   0  0  0     6    6   13    0    0     0    0    0   114   521    8   46   357    2  NaN   NaN  NaN         Georgia

Upvotes: 1

Marc Dillar
Marc Dillar

Reputation: 501

The find method you're using will only return HTML comments that contain 'class="table_outer_container"' whereas I assume that you wanted to get the content of the element whose class is table_outer_container

You can do this as follow:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://www.pro-football-reference.com/years/2020/draft.htm'
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")
table = soup.find('div', class_='table_outer_container')

Upvotes: 3

Related Questions