MJ95
MJ95

Reputation: 479

Python: BeautifulSoup combine different table headers from same table

new to Python so this might be a basic question but I have the following table:

https://www.sports-reference.com/cfb/years/1991-passing.html

And I'd like to scrape it with BeautifulSoup to have an output like this:

Player School Conf all the way to TD under Rushing
Ty Detmer Brigham Young WAC 7
Player Two School Two Conf 2 5

Problem 1: If you look at the URL above, every 21st row is a header row that should be ignored Problem 2: "Rushing" seems to be another th so my code and output below currently are like this:

import requests
import lxml.html as lh
import pandas as pd
from bs4 import BeautifulSoup

data_universe = {}
years = list(range(1990,1991))
COLUMNS = ['Player', 'School', 'Conf', 'G', 'Cmp', 'Att', 'Pct', 'Yds', 'Y/A', 'AY/A', 'TD', 'Int', 'Rate', 'Rush_Att', 'Rush_Yds', 'Avg', 'Rush_TD']

for year in years:

  url = 'https://www.sports-reference.com/cfb/years/%s-passing.html' %year
  r = requests.get(url)
  soup = BeautifulSoup(r.content, 'html.parser')
  parsed_table = soup.find_all('table')[0]
  rows = parsed_table.find_all("tr") 

  cy_data = []
  for row in rows[2:]:
      cells = row.find_all("td")
      cells = cells[0:18] 
      cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
  cy_data = pd.DataFrame(cy_data, columns=COLUMNS)
  pd.set_option("display.max_rows", None, "display.max_columns", None)
  print(cy_data.head())

Output:

enter image description here

How can I format this table neatly in the dataframe like it is on the site?

Upvotes: 0

Views: 417

Answers (1)

RJ Adriaansen
RJ Adriaansen

Reputation: 9619

You can load a html table directly into pandas using read_html, no need to use BeautifulSoup. You can then process the dataframe by removing the top header row and the mid-table header rows:

df = pd.read_html('https://www.sports-reference.com/cfb/years/1991-passing.html')[0]
df.columns = df.columns.droplevel(0) # drop top header row
df = df[df['Rk'].ne('Rk')] # remove mid-table header rows 

Output:

|    |   Rk | Player         | School        | Conf     |   G |   Cmp |   Att |   Pct |   Yds |   Y/A |   AY/A |   TD |   Int |   Rate |   Att |   Yds |   Avg |   TD |
|---:|-----:|:---------------|:--------------|:---------|----:|------:|------:|------:|------:|------:|-------:|-----:|------:|-------:|------:|------:|------:|-----:|
|  0 |    1 | Ty Detmer      | Brigham Young | WAC      |  12 |   249 |   403 |  61.8 |  4031 |  10   |   10.4 |   35 |    12 |  168.5 |    75 |   -30 |  -0.4 |    4 |
|  1 |    2 | Rick Mirer     | Notre Dame    | Ind      |  12 |   132 |   234 |  56.4 |  2117 |   9   |    8.7 |   18 |    10 |  149.2 |    75 |   306 |   4.1 |    9 |
|  2 |    3 | J.J. Joe       | Baylor        | SWC      |  11 |   109 |   206 |  52.9 |  1853 |   9   |    7.9 |    7 |     8 |  131.9 |   116 |   147 |   1.3 |    6 |
|  3 |    4 | Shane Matthews | Florida       | SEC      |  11 |   218 |   361 |  60.4 |  3130 |   8.7 |    8   |   28 |    18 |  148.8 |    50 |    10 |   0.2 |    1 |
|  4 |    5 | Marvin Graves  | Syracuse      | Big East |  11 |   131 |   221 |  59.3 |  1912 |   8.7 |    7.3 |   10 |    11 |  136.9 |    99 |  -148 |  -1.5 |    1 |

Upvotes: 3

Related Questions