88dlove
88dlove

Reputation: 25

Unable to Write Table to Pandas DataFrame

I am attempting to scrape a webpage that has a table that is embedded in an HTML comment that gets loaded after some JavaScript executes. I am using requests_html and render to execute the JavaScript and obtain the full page including the table (which is actually the second table on the page), and that works well. The problem I run into is when I try to include that table in a Pandas DataFrame.

I have tried a couple of different options to try and get the data into a format. After rendering the webpage I have tried to iterate through the table from which I can print the html of the table, as well as just the text of the table, but when I try to iterate through the table and insert the data into a Pandas DataFrame it fails.

from requests_html import HTMLSession

url = 'https://www.sports-reference.com/cfb/schools/iowa/2018/gamelog/'

session = HTMLSession()

r = session.get(url)

r.html.render()

table = r.html.find('table#defense')

defensive_game_list = []

for d_stats in table:
     d_stats_sum = d_stats.find_all('td')
     d_game_sum = [d_stats.text for d_stats in d_stats_sum]
     defensive_game_list.append(d_game_sum)

df_defense = pd.DataFrame(deffensive_sum_final)

When I run the code I receive the following error after the iterate:

Traceback (most recent call last): File "", line 2, in AttributeError: 'Element' object has no attribute 'find_all'

What I am hoping for it to do it put the text of the table into an empty list and then put that list into the DataFrame.

Any help would be greatly appreciated. Thanks!

Upvotes: 1

Views: 145

Answers (1)

Bitto
Bitto

Reputation: 8215

I would like to mention 2 Points .

a) The table you want is already present in the html. It is just commented out. If you want, you can avoid using requests-html and just use requests.

b) You can use read_html to get a DataFrame directly from an html table.

Here i am just getting the comment and converting it into a DataFrame

import requests
import pandas as pd
from bs4 import BeautifulSoup
from bs4 import Comment
url = 'https://www.sports-reference.com/cfb/schools/iowa/2018/gamelog/'
r = requests.get(url)
soup=BeautifulSoup(r.text,'lxml')
d_table=soup.find('div',id='all_defense').find(string=lambda text:isinstance(text,Comment))
df= pd.read_html(d_table)
print(df)

Output

[   Unnamed: 0_level_0     Passing            Rushing         ...         Unnamed: 23_level_0 Unnamed: 24_level_0 Unnamed: 25_level_0
                   Rk        Date Unnamed: 2_level_1         ...                         Fum                 Int                  TO
0                 1.0  2018-09-01                NaN         ...                           1                   1                   2
1                 2.0  2018-09-08                NaN         ...                           1                   0                   1
2                 3.0  2018-09-15                NaN         ...                           1                   1                   2
3                 4.0  2018-09-22                NaN         ...                           0                   0                   0
4                 5.0  2018-10-06                  @         ...                           0                   4                   4
5                 6.0  2018-10-13                  @         ...                           0                   2                   2
6                 7.0  2018-10-20                NaN         ...                           1                   1                   2
7                 8.0  2018-10-27                  @         ...                           1                   1                   2
8                 9.0  2018-11-03                  @         ...                           0                   2                   2
9                10.0  2018-11-10                NaN         ...                           0                   2                   2
10               11.0  2018-11-17                  @         ...                           1                   3                   4
11               12.0  2018-11-23                NaN         ...                           0                   1                   1
12               13.0  2019-01-01                  N         ...                           1                   2                   3
13                NaN    13 Games                NaN         ...                           7                  20                  27

[14 rows x 26 columns]]

Upvotes: 1

Related Questions