Reputation: 479
new to Python so this might be a basic question but I have the following table:
https://www.sports-reference.com/cfb/years/1991-passing.html
And I'd like to scrape it with BeautifulSoup
to have an output like this:
Player | School | Conf | all the way to TD under Rushing |
---|---|---|---|
Ty Detmer | Brigham Young | WAC | 7 |
Player Two | School Two | Conf 2 | 5 |
Problem 1: If you look at the URL above, every 21st row is a header row that should be ignored
Problem 2: "Rushing" seems to be another th
so my code and output below currently are like this:
import requests
import lxml.html as lh
import pandas as pd
from bs4 import BeautifulSoup
data_universe = {}
years = list(range(1990,1991))
COLUMNS = ['Player', 'School', 'Conf', 'G', 'Cmp', 'Att', 'Pct', 'Yds', 'Y/A', 'AY/A', 'TD', 'Int', 'Rate', 'Rush_Att', 'Rush_Yds', 'Avg', 'Rush_TD']
for year in years:
url = 'https://www.sports-reference.com/cfb/years/%s-passing.html' %year
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
parsed_table = soup.find_all('table')[0]
rows = parsed_table.find_all("tr")
cy_data = []
for row in rows[2:]:
cells = row.find_all("td")
cells = cells[0:18]
cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
cy_data = pd.DataFrame(cy_data, columns=COLUMNS)
pd.set_option("display.max_rows", None, "display.max_columns", None)
print(cy_data.head())
Output:
How can I format this table neatly in the dataframe like it is on the site?
Upvotes: 0
Views: 417
Reputation: 9619
You can load a html table directly into pandas using read_html
, no need to use BeautifulSoup. You can then process the dataframe by removing the top header row and the mid-table header rows:
df = pd.read_html('https://www.sports-reference.com/cfb/years/1991-passing.html')[0]
df.columns = df.columns.droplevel(0) # drop top header row
df = df[df['Rk'].ne('Rk')] # remove mid-table header rows
Output:
| | Rk | Player | School | Conf | G | Cmp | Att | Pct | Yds | Y/A | AY/A | TD | Int | Rate | Att | Yds | Avg | TD |
|---:|-----:|:---------------|:--------------|:---------|----:|------:|------:|------:|------:|------:|-------:|-----:|------:|-------:|------:|------:|------:|-----:|
| 0 | 1 | Ty Detmer | Brigham Young | WAC | 12 | 249 | 403 | 61.8 | 4031 | 10 | 10.4 | 35 | 12 | 168.5 | 75 | -30 | -0.4 | 4 |
| 1 | 2 | Rick Mirer | Notre Dame | Ind | 12 | 132 | 234 | 56.4 | 2117 | 9 | 8.7 | 18 | 10 | 149.2 | 75 | 306 | 4.1 | 9 |
| 2 | 3 | J.J. Joe | Baylor | SWC | 11 | 109 | 206 | 52.9 | 1853 | 9 | 7.9 | 7 | 8 | 131.9 | 116 | 147 | 1.3 | 6 |
| 3 | 4 | Shane Matthews | Florida | SEC | 11 | 218 | 361 | 60.4 | 3130 | 8.7 | 8 | 28 | 18 | 148.8 | 50 | 10 | 0.2 | 1 |
| 4 | 5 | Marvin Graves | Syracuse | Big East | 11 | 131 | 221 | 59.3 | 1912 | 8.7 | 7.3 | 10 | 11 | 136.9 | 99 | -148 | -1.5 | 1 |
Upvotes: 3