Reputation: 25
I am attempting to scrape a webpage that has a table that is embedded in an HTML comment that gets loaded after some JavaScript executes. I am using requests_html and render to execute the JavaScript and obtain the full page including the table (which is actually the second table on the page), and that works well. The problem I run into is when I try to include that table in a Pandas DataFrame.
I have tried a couple of different options to try and get the data into a format. After rendering the webpage I have tried to iterate through the table from which I can print the html of the table, as well as just the text of the table, but when I try to iterate through the table and insert the data into a Pandas DataFrame it fails.
from requests_html import HTMLSession
url = 'https://www.sports-reference.com/cfb/schools/iowa/2018/gamelog/'
session = HTMLSession()
r = session.get(url)
r.html.render()
table = r.html.find('table#defense')
defensive_game_list = []
for d_stats in table:
d_stats_sum = d_stats.find_all('td')
d_game_sum = [d_stats.text for d_stats in d_stats_sum]
defensive_game_list.append(d_game_sum)
df_defense = pd.DataFrame(deffensive_sum_final)
When I run the code I receive the following error after the iterate:
Traceback (most recent call last): File "", line 2, in AttributeError: 'Element' object has no attribute 'find_all'
What I am hoping for it to do it put the text of the table into an empty list and then put that list into the DataFrame.
Any help would be greatly appreciated. Thanks!
Upvotes: 1
Views: 145
Reputation: 8215
I would like to mention 2 Points .
a) The table you want is already present in the html. It is just commented out. If you want, you can avoid using requests-html and just use requests.
b) You can use read_html to get a DataFrame directly from an html table.
Here i am just getting the comment and converting it into a DataFrame
import requests
import pandas as pd
from bs4 import BeautifulSoup
from bs4 import Comment
url = 'https://www.sports-reference.com/cfb/schools/iowa/2018/gamelog/'
r = requests.get(url)
soup=BeautifulSoup(r.text,'lxml')
d_table=soup.find('div',id='all_defense').find(string=lambda text:isinstance(text,Comment))
df= pd.read_html(d_table)
print(df)
Output
[ Unnamed: 0_level_0 Passing Rushing ... Unnamed: 23_level_0 Unnamed: 24_level_0 Unnamed: 25_level_0
Rk Date Unnamed: 2_level_1 ... Fum Int TO
0 1.0 2018-09-01 NaN ... 1 1 2
1 2.0 2018-09-08 NaN ... 1 0 1
2 3.0 2018-09-15 NaN ... 1 1 2
3 4.0 2018-09-22 NaN ... 0 0 0
4 5.0 2018-10-06 @ ... 0 4 4
5 6.0 2018-10-13 @ ... 0 2 2
6 7.0 2018-10-20 NaN ... 1 1 2
7 8.0 2018-10-27 @ ... 1 1 2
8 9.0 2018-11-03 @ ... 0 2 2
9 10.0 2018-11-10 NaN ... 0 2 2
10 11.0 2018-11-17 @ ... 1 3 4
11 12.0 2018-11-23 NaN ... 0 1 1
12 13.0 2019-01-01 N ... 1 2 3
13 NaN 13 Games NaN ... 7 20 27
[14 rows x 26 columns]]
Upvotes: 1