acegay27
acegay27

Reputation: 7

Returning Text From Specific Class BeatifulSoup4

I am new to BS4 and trying to web scrape data for a specific HMTL class. A snippet of my HTML data looks like the following

<td class="right">31</td == $0
<td class="right gamelink">
   <a href="/boxscores/20220908ram.htm">
      "F"
      <span class =no_mobile">inal</span>
   </a>
</td>

The problem I am having is that when I try to FindAll() for the class "right", I am also seeing the contents of the class "right gamelink". Is there a way to specify that the returned text should only come from the "right" class instead of the "right gamelink" class?

Code:

from bs4 import BeautifulSoup
import requests


weekNumber = 1
url = "https://www.pro-football-reference.com/years/2022/week_"+str(weekNumber)+".htm"

print(url)

req = requests.get(url)
webpage = BeautifulSoup(req.text, 'html.parser')

scores = webpage.findAll("td", attrs={'class': 'right'})

for score in scores:
    current_score = score.text.strip()
    print(current_score)

Output:

31
Final

Upvotes: 0

Views: 62

Answers (3)

Andrei
Andrei

Reputation: 21

The issue in your case :

<div>
     <td class="right">31</td>
     <td class="right gamelink"></td>
</div>

Both td's have class "right". The difference is that the 2nd td has a second class which is called "gamelink", so you want to get only the td elements which only have class "right" and not another classes. Your code returns all td elements which have "right" class which is correct. If you want only to get the elements which only have "right" class , you can achieve this by a css selector and replace

scores = webpage.findAll("td", attrs={'class': 'right'})

with this:

scores = webpage.select("td[class='right']")

And you should obtain all the elements which only have the class "right".

Upvotes: 1

Sergey K
Sergey K

Reputation: 1687

maybe it will be useful for you

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://www.pro-football-reference.com/years/2022/week_1.htm'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
results = []
for game in soup.find_all('div', class_='game_summary expanded nohover'):
    teams = []
    for x in game.find('table', class_='teams').find_all('tr'):
        teams.append(list(filter(None, [a.strip() for a in x.get_text().split('\n')])))
    results.append({
        'Date': teams[0][0],
        'Home': {
            'Name': teams[1][0],
            'Score': teams[1][1]
        },
        'Guest': {
            'Name': teams[2][0],
            'Score': teams[2][1]
        },
        'Result': (lambda r: teams[1][-1] if len(teams[2]) < 3 else f'{teams[1][-1]} {teams[2][-1]}')(teams)
    })
df = pd.DataFrame(results)
print(df.to_string(index=False))

OUTPUT:

        Date                                            Home                                            Guest   Result
 Sep 8, 2022        {'Name': 'Buffalo Bills', 'Score': '31'}      {'Name': 'Los Angeles Rams', 'Score': '10'}    Final
Sep 11, 2022   {'Name': 'New Orleans Saints', 'Score': '27'}       {'Name': 'Atlanta Falcons', 'Score': '26'}    Final
Sep 11, 2022     {'Name': 'Cleveland Browns', 'Score': '26'}     {'Name': 'Carolina Panthers', 'Score': '24'}    Final
Sep 11, 2022  {'Name': 'San Francisco 49ers', 'Score': '10'}         {'Name': 'Chicago Bears', 'Score': '19'}    Final
Sep 11, 2022  {'Name': 'Pittsburgh Steelers', 'Score': '23'}    {'Name': 'Cincinnati Bengals', 'Score': '20'} Final OT
Sep 11, 2022  {'Name': 'Philadelphia Eagles', 'Score': '38'}         {'Name': 'Detroit Lions', 'Score': '35'}    Final
Sep 11, 2022   {'Name': 'Indianapolis Colts', 'Score': '20'}        {'Name': 'Houston Texans', 'Score': '20'} Final OT
Sep 11, 2022  {'Name': 'New England Patriots', 'Score': '7'}        {'Name': 'Miami Dolphins', 'Score': '20'}    Final
Sep 11, 2022     {'Name': 'Baltimore Ravens', 'Score': '24'}          {'Name': 'New York Jets', 'Score': '9'}    Final
Sep 11, 2022 {'Name': 'Jacksonville Jaguars', 'Score': '22'} {'Name': 'Washington Commanders', 'Score': '28'}    Final
Sep 11, 2022   {'Name': 'Kansas City Chiefs', 'Score': '44'}     {'Name': 'Arizona Cardinals', 'Score': '21'}    Final
Sep 11, 2022     {'Name': 'Green Bay Packers', 'Score': '7'}     {'Name': 'Minnesota Vikings', 'Score': '23'}    Final
Sep 11, 2022      {'Name': 'New York Giants', 'Score': '21'}      {'Name': 'Tennessee Titans', 'Score': '20'}    Final
Sep 11, 2022    {'Name': 'Las Vegas Raiders', 'Score': '19'}  {'Name': 'Los Angeles Chargers', 'Score': '24'}    Final
Sep 11, 2022 {'Name': 'Tampa Bay Buccaneers', 'Score': '19'}         {'Name': 'Dallas Cowboys', 'Score': '3'}    Final
Sep 12, 2022       {'Name': 'Denver Broncos', 'Score': '16'}      {'Name': 'Seattle Seahawks', 'Score': '17'}    Final

Or u can change dict, to

results.append({
        'Date': teams[0][0],
        'Home Team': teams[1][0],
        'Guest Team': teams[2][0],
        'Score': f'{teams[1][1]}-{teams[2][1]}',
        'Result': (lambda r: teams[1][-1] if len(teams[2]) < 3 else f'{teams[1][-1]} {teams[2][-1]}')(teams)
    })

And ur table now looks like:

        Date            Home Team            Guest Team Score   Result
 Sep 8, 2022        Buffalo Bills      Los Angeles Rams 31-10    Final
Sep 11, 2022   New Orleans Saints       Atlanta Falcons 27-26    Final
Sep 11, 2022     Cleveland Browns     Carolina Panthers 26-24    Final
Sep 11, 2022  San Francisco 49ers         Chicago Bears 10-19    Final
Sep 11, 2022  Pittsburgh Steelers    Cincinnati Bengals 23-20 Final OT
Sep 11, 2022  Philadelphia Eagles         Detroit Lions 38-35    Final
Sep 11, 2022   Indianapolis Colts        Houston Texans 20-20 Final OT
Sep 11, 2022 New England Patriots        Miami Dolphins  7-20    Final
Sep 11, 2022     Baltimore Ravens         New York Jets  24-9    Final
Sep 11, 2022 Jacksonville Jaguars Washington Commanders 22-28    Final
Sep 11, 2022   Kansas City Chiefs     Arizona Cardinals 44-21    Final
Sep 11, 2022    Green Bay Packers     Minnesota Vikings  7-23    Final
Sep 11, 2022      New York Giants      Tennessee Titans 21-20    Final
Sep 11, 2022    Las Vegas Raiders  Los Angeles Chargers 19-24    Final
Sep 11, 2022 Tampa Bay Buccaneers        Dallas Cowboys  19-3    Final
Sep 12, 2022       Denver Broncos      Seattle Seahawks 16-17    Final

Upvotes: 0

Jack Fleeting
Jack Fleeting

Reputation: 24940

Use css selectors instead - with the format below. So change

scores = webpage.findAll("td", attrs={'class': 'right'})

to

scores = webpage.select('td[class="right"]')

and see if it works.

Upvotes: 0

Related Questions