wazo
wazo

Reputation: 311

Get values using BS4

I'm trying to get the "data-val" from my soup, but they all come in a huge list and not formatted in different lists/columns as show in the website.

I know the headers are here:

<th class="num record drop-3" data-tsorter="data-val">
    <span class="long-points">
     proj. pts.
    </span>
    <span class="short-points">
     pts.
    </span>
   </th>
   <th class="pct" data-tsorter="data-val">
    <span class="full-relegated">
     relegated
    </span>
    <span class="small-relegated">
     rel.
    </span>
   </th>
   <th class="pct" data-tsorter="data-val">
    <span class="full-champ">
     qualify for UCL
    </span>
    <span class="small-champ">
     make UCL
    </span>
   </th>
   <th class="pct sorted" data-tsorter="data-val">
    <span class="drop-1">
     win Premier League
    </span>
    <span class="small-league">
     win league
    </span>
   </th>

This is what I'm trying:

url = 'https://projects.fivethirtyeight.com/soccer-predictions/premier-league/'
r = requests.get(url = url)
soup = BeautifulSoup(r.text, "html.parser")
table = soup.find("table", {"class":"forecast-table"})
#print(table.prettify())
for i in table.find_all("td", {"class":"pct"}):
     print(i)

So ideally I'd like 4 lists, with the class names and then the matching values

Upvotes: 1

Views: 192

Answers (1)

Danielle M.
Danielle M.

Reputation: 3662

Not entirely sure what specific cols you want but this gets all the ones with a data-val in the tag's attributes:

import requests
from bs4 import BeautifulSoup

url = 'https://projects.fivethirtyeight.com/soccer-predictions/premier-league/'
r = requests.get(url)

soup = BeautifulSoup(r.text, "html.parser")
table = soup.find("table", {"class": "forecast-table"})

team_rows = table.find_all("tr", {"class": "team-row"})

for team in team_rows:
    print("Team name: {}".format(team['data-str']))

    team_data = team.find_all("td")

    for data in team_data:
        if hasattr(data, 'attrs') and 'data-val' in data.attrs:
            print("\t{}".format(data.attrs['data-val']))
    print("\n")

If I do understand your question correctly, you're looking for the last couple of values, which are fairly untagged in the html source. When that's the case, you can try simply looking for tag[6], although it's of course not very robust - but this is html parsing, so "not very robust" is par for the course imho.

what I'm doing here is finding all the team rows (which is easy thanks to the class name), and then simply looping through all the td tags that are in the team rows' tr.

Upvotes: 2

Related Questions