eucalysis
eucalysis

Reputation: 23

beautiful soup: trying to get children of a div

I'm trying to get the team names and scores of overwatch league matches from: https://overwatchleague.com/en-us/schedule?stage=regular_season&week=1

what I need to do is to web scrape a series of children from children of a larger div

so far I have:

matches = bs.find_all('div', {'class': 'schedule-boardstyles__ContainerCards-j4x5cc-8 jcvNlt'})

    for match in matches:
        rows = match.find_all('div', {'class': 'schedule-boardstyles__ContainerMatchCard-j4x5cc-9 esCuul match-cardstyles__Container-sc-1rgscfz-0 doBeIs'})
        print("here")
        for row in rows:
            print('here2')
            team1 = row.find('p', {'class': 'match-cardstyles__MiddleText-sc-1rgscfz-12 hueupq'})
            score1 = row.find('p', {'class': 'match-cardstyles__ScoreText-sc-1rgscfz-23 gOtrSB'})
            score2 = row.find('p', {'class': 'match-cardstyles__ScoreText-sc-1rgscfz-23 jRejaZ'})
            team2 = row.find('p', {'class': 'match-cardstyles__MiddleText-sc-1rgscfz-12 cLYgmY'})
            temp = 'team_1:{}, score":{}-{}", team_2:{}'.format(team1.text, score1.text, score2.text,team2.text)
            print(temp)
            match_schedule.append(temp)

but its returning nothing, even from the initial matches scrape, is there something that I'm doing wrong?

Upvotes: 1

Views: 70

Answers (1)

Martin Evans
Martin Evans

Reputation: 46759

The information is generated dynamically and so would normally need a browser to construct it. It can however also be extracted in two steps using the site's API. Firstly access the main page to determine the schedule ID needed. This can then be used to request the associated matches. The information is returned in JSON format.

For example:

import requests
from bs4 import BeautifulSoup
import json

url = "https://overwatchleague.com/en-us/schedule?stage=regular_season&week=1"
session = requests.Session()

r_main = session.get(url)
soup = BeautifulSoup(r_main.content, "html.parser")
js = soup.find('script', id="__NEXT_DATA__")
data_main = json.loads(js.string)
schedule = data_main['props']['pageProps']['blocks'][2]['schedule']['uid']

headers = {
    "Referer" : "https://overwatchleague.com/",
    "x-origin" : "overwatchleague.com",
    "Origin" : "https://overwatchleague.com",
    "DNT": "1",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "cross-site",
}

r_schedule = session.get(f'https://wzavfvwgfk.execute-api.us-east-2.amazonaws.com/production/v2/content-types/schedule/{schedule}/week/1?locale=en-us', headers=headers)
data_schedule = r_schedule.json()

matches = []

for match in data_schedule['data']['tableData']['events'][0]['matches']:
    competitors = [c['name'] for c in match['competitors']]
    scores = match['scores']
    row = (competitors[0], scores[0], competitors[1], scores[1])
    matches.append(row)
    
    print(f"{row[0]:25}  {row[1]:2}  {row[2]:25}  {row[3]}")

Giving you:

Houston Outlaws             3  Dallas Fuel                2
Los Angeles Gladiators      1  San Francisco Shock        3
Guangzhou Charge            0  Shanghai Dragons           3
Los Angeles Valiant         1  Chengdu Hunters            3
Philadelphia Fusion         3  Seoul Dynasty              1
Toronto Defiant             3  Vancouver Titans           1
Atlanta Reign               1  Florida Mayhem             3
Dallas Fuel                 3  Los Angeles Gladiators     1
Guangzhou Charge            0  Seoul Dynasty              3
Chengdu Hunters             3  Shanghai Dragons           0
Philadelphia Fusion         3  Los Angeles Valiant        0
Houston Outlaws             3  San Francisco Shock        2
Florida Mayhem              3  Vancouver Titans           1
Toronto Defiant             3  Atlanta Reign              2

I strongly recommend you print out the JSON e.g. data_schedule to get a better understanding of all of the information that is returned. The other details in the script were obtained by using the browser's developer feature to see which requests were made whilst the page loaded.

Upvotes: 1

Related Questions