Python Webscraping a table into a Dataframe

I'm trying to figure out how to take this table and put it into a dataframe, however I can't seem to figure out how to do it. So far I've been attempting to go about this from some of the things I have learned in class with a mixture of an answer that was posted here in this forum. But I still can't get it to work. Can anyone help me and explain what they did. I have put my code below:

import requests
import pandas
from bs4 import BeautifulSoup

page = requests.get("https://www.sports-reference.com/cbb/schools/duke/2021-schedule.html")
soup = BeautifulSoup(page, "html.parser")
table = soup.find("table", attrs={"class":"sortable stats_table now_sortable"})
table_rows = table.find_all('tr')

l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
#test columns
df = pd.DataFrame(l, columns=["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O"])
print(df)

Upvotes: 0

Answers (3)

yoonghm

Reputation: 4625

My solution,

import pandas as pd

df = pd.read_html("https://www.sports-reference.com/cbb/schools/duke/2021-schedule.html")[1]

# Generate a list of the new columns
new_columns = [chr(x) for x in range(ord('A'), ord('O')+1)]
columns = dict(zip(df.columns, new_columns)
df.rename(columns=columns, inplace=True)
print(df)

     A                  B       C    D    E                     F        G    H     I     J   K    L    M    N                                     O
0    1  Sat, Nov 28, 2020   2:00p  REG  NaN          Coppin State     MEAC    W  81.0  71.0 NaN  1.0  0.0  W 1                Cameron Indoor Stadium
1    2   Tue, Dec 1, 2020   7:30p  REG  NaN    Michigan State (8)  Big Ten    L  69.0  75.0 NaN  1.0  1.0  L 1                Cameron Indoor Stadium
2    3   Fri, Dec 4, 2020   7:00p  REG  NaN            Bellarmine    A-Sun    W  76.0  54.0 NaN  2.0  1.0  W 1                Cameron Indoor Stadium
3    4   Tue, Dec 8, 2020   9:30p  REG  NaN          Illinois (6)  Big Ten    L  68.0  83.0 NaN  2.0  2.0  L 1                Cameron Indoor Stadium
4    5  Wed, Dec 16, 2020   9:00p  REG    @            Notre Dame      ACC    W  75.0  65.0 NaN  3.0  2.0  W 1  Purcell Pavilion at the Joyce Center
5    6   Wed, Jan 6, 2021   8:30p  REG  NaN        Boston College      ACC    W  83.0  82.0 NaN  4.0  2.0  W 2                Cameron Indoor Stadium
6    7   Sat, Jan 9, 2021  12:00p  REG  NaN           Wake Forest      ACC    W  79.0  68.0 NaN  5.0  2.0  W 3                Cameron Indoor Stadium
7    8  Tue, Jan 12, 2021   7:00p  REG    @    Virginia Tech (20)      ACC    L  67.0  74.0 NaN  5.0  3.0  L 1                      Cassell Coliseum
8    9  Tue, Jan 19, 2021   9:00p  REG    @            Pittsburgh      ACC    L  73.0  79.0 NaN  5.0  4.0  L 2                Petersen Events Center
9   10  Sat, Jan 23, 2021   4:00p  REG    @            Louisville      ACC    L  65.0  70.0 NaN  5.0  5.0  L 3                       KFC Yum! Center
10  11  Tue, Jan 26, 2021   9:00p  REG  NaN          Georgia Tech      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
11  12  Sat, Jan 30, 2021  12:00p  REG  NaN          Clemson (20)      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
12  13   Mon, Feb 1, 2021   7:00p  REG    @            Miami (FL)      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
13  14   Sat, Feb 6, 2021   6:00p  REG  NaN        North Carolina      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
14  15   Tue, Feb 9, 2021   4:00p  REG  NaN            Notre Dame      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
15  16  Sat, Feb 13, 2021   4:00p  REG    @  North Carolina State      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
16  17  Wed, Feb 17, 2021   8:30p  REG    @           Wake Forest      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
17  18  Sat, Feb 20, 2021     NaN  REG  NaN         Virginia (13)      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
18  19  Mon, Feb 22, 2021   7:00p  REG  NaN              Syracuse      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
19  20  Sat, Feb 27, 2021   6:00p  REG  NaN            Louisville      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
20  21   Tue, Mar 2, 2021   7:00p  REG    @          Georgia Tech      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN
21  22   Sat, Mar 6, 2021   6:00p  REG    @        North Carolina      ACC  NaN   NaN   NaN NaN  NaN  NaN  NaN                                   NaN

Upvotes: 3

Jonathan Leon

Reputation: 5648

You have a few things wrong here. Including what @Ferris mentioned. This will get you started

import pandas as pd #read this in correcly as pd
page = requests.get("https://www.sports-reference.com/cbb/schools/duke/2021-schedule.html")
soup = BeautifulSoup(page.text, "html.parser") # use page.text
# table = soup.find("table", attrs={"class":"sortable stats_table"})
table = soup.find("table", attrs={"id":"schedule"}) #use the id if available; couldn't get class to work when space is in class name
table_rows = table.find_all('tr')

# this works below as you have it but it doesn't read into the dataframe correctly
l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
#test columns
# read without columns to see what you have
df = pd.DataFrame(l)
# df = pd.DataFrame(l, columns=["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O"])
print(df)

Upvotes: 2

PSK

Reputation: 357

Many ways to do the same thing. This is probably not the best way but it gets the work done.

import requests
import pandas as pd
from bs4 import BeautifulSoup

page = requests.get("https://www.sports-reference.com/cbb/schools/duke/2021-schedule.html")
soup = BeautifulSoup(page.content, "html.parser")

table_header = soup.find_all("thead")[1]
table_header_rows = table_header.find_all('tr')
table_header_text = []
for tr in table_header_rows:
    th = tr.find_all('th')
    row = [tr.text for tr in th]
    table_header_text.append(row)

table_body = soup.find_all("tbody")[1]
table_body_rows = table_body.find_all('tr')
table_body_text = []
for tr in table_body_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    table_body_text.append(row)
    
pd.DataFrame(table_body_text, columns=table_header_text[0][1:])

Upvotes: 1

Python Webscraping a table into a Dataframe

Answers (3)

Related Questions