Reputation: 7
I am new to BS4 and trying to web scrape data for a specific HMTL class. A snippet of my HTML data looks like the following
<td class="right">31</td == $0
<td class="right gamelink">
<a href="/boxscores/20220908ram.htm">
"F"
<span class =no_mobile">inal</span>
</a>
</td>
The problem I am having is that when I try to FindAll() for the class "right", I am also seeing the contents of the class "right gamelink". Is there a way to specify that the returned text should only come from the "right" class instead of the "right gamelink" class?
Code:
from bs4 import BeautifulSoup
import requests
weekNumber = 1
url = "https://www.pro-football-reference.com/years/2022/week_"+str(weekNumber)+".htm"
print(url)
req = requests.get(url)
webpage = BeautifulSoup(req.text, 'html.parser')
scores = webpage.findAll("td", attrs={'class': 'right'})
for score in scores:
current_score = score.text.strip()
print(current_score)
Output:
31
Final
Upvotes: 0
Views: 62
Reputation: 21
The issue in your case :
<div>
<td class="right">31</td>
<td class="right gamelink"></td>
</div>
Both td's have class "right". The difference is that the 2nd td has a second class which is called "gamelink", so you want to get only the td elements which only have class "right" and not another classes. Your code returns all td elements which have "right" class which is correct. If you want only to get the elements which only have "right" class , you can achieve this by a css selector and replace
scores = webpage.findAll("td", attrs={'class': 'right'})
with this:
scores = webpage.select("td[class='right']")
And you should obtain all the elements which only have the class "right".
Upvotes: 1
Reputation: 1687
maybe it will be useful for you
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.pro-football-reference.com/years/2022/week_1.htm'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
results = []
for game in soup.find_all('div', class_='game_summary expanded nohover'):
teams = []
for x in game.find('table', class_='teams').find_all('tr'):
teams.append(list(filter(None, [a.strip() for a in x.get_text().split('\n')])))
results.append({
'Date': teams[0][0],
'Home': {
'Name': teams[1][0],
'Score': teams[1][1]
},
'Guest': {
'Name': teams[2][0],
'Score': teams[2][1]
},
'Result': (lambda r: teams[1][-1] if len(teams[2]) < 3 else f'{teams[1][-1]} {teams[2][-1]}')(teams)
})
df = pd.DataFrame(results)
print(df.to_string(index=False))
OUTPUT:
Date Home Guest Result
Sep 8, 2022 {'Name': 'Buffalo Bills', 'Score': '31'} {'Name': 'Los Angeles Rams', 'Score': '10'} Final
Sep 11, 2022 {'Name': 'New Orleans Saints', 'Score': '27'} {'Name': 'Atlanta Falcons', 'Score': '26'} Final
Sep 11, 2022 {'Name': 'Cleveland Browns', 'Score': '26'} {'Name': 'Carolina Panthers', 'Score': '24'} Final
Sep 11, 2022 {'Name': 'San Francisco 49ers', 'Score': '10'} {'Name': 'Chicago Bears', 'Score': '19'} Final
Sep 11, 2022 {'Name': 'Pittsburgh Steelers', 'Score': '23'} {'Name': 'Cincinnati Bengals', 'Score': '20'} Final OT
Sep 11, 2022 {'Name': 'Philadelphia Eagles', 'Score': '38'} {'Name': 'Detroit Lions', 'Score': '35'} Final
Sep 11, 2022 {'Name': 'Indianapolis Colts', 'Score': '20'} {'Name': 'Houston Texans', 'Score': '20'} Final OT
Sep 11, 2022 {'Name': 'New England Patriots', 'Score': '7'} {'Name': 'Miami Dolphins', 'Score': '20'} Final
Sep 11, 2022 {'Name': 'Baltimore Ravens', 'Score': '24'} {'Name': 'New York Jets', 'Score': '9'} Final
Sep 11, 2022 {'Name': 'Jacksonville Jaguars', 'Score': '22'} {'Name': 'Washington Commanders', 'Score': '28'} Final
Sep 11, 2022 {'Name': 'Kansas City Chiefs', 'Score': '44'} {'Name': 'Arizona Cardinals', 'Score': '21'} Final
Sep 11, 2022 {'Name': 'Green Bay Packers', 'Score': '7'} {'Name': 'Minnesota Vikings', 'Score': '23'} Final
Sep 11, 2022 {'Name': 'New York Giants', 'Score': '21'} {'Name': 'Tennessee Titans', 'Score': '20'} Final
Sep 11, 2022 {'Name': 'Las Vegas Raiders', 'Score': '19'} {'Name': 'Los Angeles Chargers', 'Score': '24'} Final
Sep 11, 2022 {'Name': 'Tampa Bay Buccaneers', 'Score': '19'} {'Name': 'Dallas Cowboys', 'Score': '3'} Final
Sep 12, 2022 {'Name': 'Denver Broncos', 'Score': '16'} {'Name': 'Seattle Seahawks', 'Score': '17'} Final
Or u can change dict, to
results.append({
'Date': teams[0][0],
'Home Team': teams[1][0],
'Guest Team': teams[2][0],
'Score': f'{teams[1][1]}-{teams[2][1]}',
'Result': (lambda r: teams[1][-1] if len(teams[2]) < 3 else f'{teams[1][-1]} {teams[2][-1]}')(teams)
})
And ur table now looks like:
Date Home Team Guest Team Score Result
Sep 8, 2022 Buffalo Bills Los Angeles Rams 31-10 Final
Sep 11, 2022 New Orleans Saints Atlanta Falcons 27-26 Final
Sep 11, 2022 Cleveland Browns Carolina Panthers 26-24 Final
Sep 11, 2022 San Francisco 49ers Chicago Bears 10-19 Final
Sep 11, 2022 Pittsburgh Steelers Cincinnati Bengals 23-20 Final OT
Sep 11, 2022 Philadelphia Eagles Detroit Lions 38-35 Final
Sep 11, 2022 Indianapolis Colts Houston Texans 20-20 Final OT
Sep 11, 2022 New England Patriots Miami Dolphins 7-20 Final
Sep 11, 2022 Baltimore Ravens New York Jets 24-9 Final
Sep 11, 2022 Jacksonville Jaguars Washington Commanders 22-28 Final
Sep 11, 2022 Kansas City Chiefs Arizona Cardinals 44-21 Final
Sep 11, 2022 Green Bay Packers Minnesota Vikings 7-23 Final
Sep 11, 2022 New York Giants Tennessee Titans 21-20 Final
Sep 11, 2022 Las Vegas Raiders Los Angeles Chargers 19-24 Final
Sep 11, 2022 Tampa Bay Buccaneers Dallas Cowboys 19-3 Final
Sep 12, 2022 Denver Broncos Seattle Seahawks 16-17 Final
Upvotes: 0
Reputation: 24940
Use css selectors instead - with the format below. So change
scores = webpage.findAll("td", attrs={'class': 'right'})
to
scores = webpage.select('td[class="right"]')
and see if it works.
Upvotes: 0