Reputation: 47
I'm trying to figure out how to take this table and put it into a dataframe, however I can't seem to figure out how to do it. So far I've been attempting to go about this from some of the things I have learned in class with a mixture of an answer that was posted here in this forum. But I still can't get it to work. Can anyone help me and explain what they did. I have put my code below:
import requests
import pandas
from bs4 import BeautifulSoup
page = requests.get("https://www.sports-reference.com/cbb/schools/duke/2021-schedule.html")
soup = BeautifulSoup(page, "html.parser")
table = soup.find("table", attrs={"class":"sortable stats_table now_sortable"})
table_rows = table.find_all('tr')
l = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
l.append(row)
#test columns
df = pd.DataFrame(l, columns=["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O"])
print(df)
Upvotes: 0
Views: 523
Reputation: 4625
My solution,
import pandas as pd
df = pd.read_html("https://www.sports-reference.com/cbb/schools/duke/2021-schedule.html")[1]
# Generate a list of the new columns
new_columns = [chr(x) for x in range(ord('A'), ord('O')+1)]
columns = dict(zip(df.columns, new_columns)
df.rename(columns=columns, inplace=True)
print(df)
A B C D E F G H I J K L M N O
0 1 Sat, Nov 28, 2020 2:00p REG NaN Coppin State MEAC W 81.0 71.0 NaN 1.0 0.0 W 1 Cameron Indoor Stadium
1 2 Tue, Dec 1, 2020 7:30p REG NaN Michigan State (8) Big Ten L 69.0 75.0 NaN 1.0 1.0 L 1 Cameron Indoor Stadium
2 3 Fri, Dec 4, 2020 7:00p REG NaN Bellarmine A-Sun W 76.0 54.0 NaN 2.0 1.0 W 1 Cameron Indoor Stadium
3 4 Tue, Dec 8, 2020 9:30p REG NaN Illinois (6) Big Ten L 68.0 83.0 NaN 2.0 2.0 L 1 Cameron Indoor Stadium
4 5 Wed, Dec 16, 2020 9:00p REG @ Notre Dame ACC W 75.0 65.0 NaN 3.0 2.0 W 1 Purcell Pavilion at the Joyce Center
5 6 Wed, Jan 6, 2021 8:30p REG NaN Boston College ACC W 83.0 82.0 NaN 4.0 2.0 W 2 Cameron Indoor Stadium
6 7 Sat, Jan 9, 2021 12:00p REG NaN Wake Forest ACC W 79.0 68.0 NaN 5.0 2.0 W 3 Cameron Indoor Stadium
7 8 Tue, Jan 12, 2021 7:00p REG @ Virginia Tech (20) ACC L 67.0 74.0 NaN 5.0 3.0 L 1 Cassell Coliseum
8 9 Tue, Jan 19, 2021 9:00p REG @ Pittsburgh ACC L 73.0 79.0 NaN 5.0 4.0 L 2 Petersen Events Center
9 10 Sat, Jan 23, 2021 4:00p REG @ Louisville ACC L 65.0 70.0 NaN 5.0 5.0 L 3 KFC Yum! Center
10 11 Tue, Jan 26, 2021 9:00p REG NaN Georgia Tech ACC NaN NaN NaN NaN NaN NaN NaN NaN
11 12 Sat, Jan 30, 2021 12:00p REG NaN Clemson (20) ACC NaN NaN NaN NaN NaN NaN NaN NaN
12 13 Mon, Feb 1, 2021 7:00p REG @ Miami (FL) ACC NaN NaN NaN NaN NaN NaN NaN NaN
13 14 Sat, Feb 6, 2021 6:00p REG NaN North Carolina ACC NaN NaN NaN NaN NaN NaN NaN NaN
14 15 Tue, Feb 9, 2021 4:00p REG NaN Notre Dame ACC NaN NaN NaN NaN NaN NaN NaN NaN
15 16 Sat, Feb 13, 2021 4:00p REG @ North Carolina State ACC NaN NaN NaN NaN NaN NaN NaN NaN
16 17 Wed, Feb 17, 2021 8:30p REG @ Wake Forest ACC NaN NaN NaN NaN NaN NaN NaN NaN
17 18 Sat, Feb 20, 2021 NaN REG NaN Virginia (13) ACC NaN NaN NaN NaN NaN NaN NaN NaN
18 19 Mon, Feb 22, 2021 7:00p REG NaN Syracuse ACC NaN NaN NaN NaN NaN NaN NaN NaN
19 20 Sat, Feb 27, 2021 6:00p REG NaN Louisville ACC NaN NaN NaN NaN NaN NaN NaN NaN
20 21 Tue, Mar 2, 2021 7:00p REG @ Georgia Tech ACC NaN NaN NaN NaN NaN NaN NaN NaN
21 22 Sat, Mar 6, 2021 6:00p REG @ North Carolina ACC NaN NaN NaN NaN NaN NaN NaN NaN
Upvotes: 3
Reputation: 5648
You have a few things wrong here. Including what @Ferris mentioned. This will get you started
import pandas as pd #read this in correcly as pd
page = requests.get("https://www.sports-reference.com/cbb/schools/duke/2021-schedule.html")
soup = BeautifulSoup(page.text, "html.parser") # use page.text
# table = soup.find("table", attrs={"class":"sortable stats_table"})
table = soup.find("table", attrs={"id":"schedule"}) #use the id if available; couldn't get class to work when space is in class name
table_rows = table.find_all('tr')
# this works below as you have it but it doesn't read into the dataframe correctly
l = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
l.append(row)
#test columns
# read without columns to see what you have
df = pd.DataFrame(l)
# df = pd.DataFrame(l, columns=["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O"])
print(df)
Upvotes: 2
Reputation: 357
Many ways to do the same thing. This is probably not the best way but it gets the work done.
import requests
import pandas as pd
from bs4 import BeautifulSoup
page = requests.get("https://www.sports-reference.com/cbb/schools/duke/2021-schedule.html")
soup = BeautifulSoup(page.content, "html.parser")
table_header = soup.find_all("thead")[1]
table_header_rows = table_header.find_all('tr')
table_header_text = []
for tr in table_header_rows:
th = tr.find_all('th')
row = [tr.text for tr in th]
table_header_text.append(row)
table_body = soup.find_all("tbody")[1]
table_body_rows = table_body.find_all('tr')
table_body_text = []
for tr in table_body_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
table_body_text.append(row)
pd.DataFrame(table_body_text, columns=table_header_text[0][1:])
Upvotes: 1