jbenfleming
jbenfleming

Reputation: 75

How to skip over certain rows in table when web scraping

I'm scraping from this link: https://www.pro-football-reference.com/boxscores/201809060phi.htm

My code is as follows:

import requests
from bs4 import BeautifulSoup

# assign url
url = 'https://www.pro-football-reference.com/boxscores/201809060phi.htm'

#parse and format url
r = requests.get(url).text
res = r.replace("<!--","").replace("-->","")
soup = BeautifulSoup(res, 'lxml')


#get tables
tables = soup.findAll("div",{"class":"table_outer_container"})

#get offense_stats table
offense_table = tables[5]
rows = offense_table.tbody.findAll("tr")

#here i want to iterate through the player rows and pull their stats

player = test_row.find("th",{"data-stat":"player"}).text
carries = test_row.find("td",{"data-stat":"rush_att"}).text
rush_yds = test_row.find("td",{"data-stat":"rush_yds"}).text
rush_tds = test_row.find("td",{"data-stat":"rush_td"}).text
targets = test_row.find("td",{"data-stat":"targets"}).text
recs = test_row.find("td",{"data-stat":"rec"}).text
rec_yds= test_row.find("td",{"data-stat":"rec_yds"}).text
rec_tds= test_row.find("td",{"data-stat":"rec_td"}).text

The table on the page that I need (offensive stats) has the stats for all the players in the game. I want to iterate through the rows pulling the stats for each player. Problem is that there are two rows in the middle that are headers and not player stats. My "rows" variable pulled all "tr" elements in the "tbody" of my "offense_table" variable. This includes the two header rows that I do not want. They would be rows[8] and rows[9] in this particular case, but that could be different from game to game.

#this is how the data rows begin (the ones I want)
<tr data-row="0">

#and this is how the header rows begin (the ones I want to skip over)
<tr class="over_header thead" data-row="8">

Anybody know a way for me to ignore these rows when iterating through?

Upvotes: 0

Views: 1735

Answers (2)

Danil
Danil

Reputation: 5171

To select only tr without class try to replace

rows = offense_table.tbody.findAll("tr")

by

rows = offense_table.findAll("tr", attrs={'class': None})

Upvotes: 1

Ruzihm
Ruzihm

Reputation: 20249

If the rows you want to skip always have the over_header class, and the rows you want to keep never do, you can filter the results of findAll("tr") for rows that don't have the over_header class:

rows = offense_table.tbody.findAll("tr")
rows = filter(lambda row: not row.find(class_='over_header'), rows)

Upvotes: 1

Related Questions