Jack L
Jack L

Reputation: 5

Beautiful Soup BS4 "data-foo" associated text between tags not displaying

From this Tag:

<div class="matchDate renderMatchDateContainer" data-kickoff="1313244000000">Sat 13 Aug 2011</div>

I want to extract the "Sat 13 Aug 2011" using bs4 Beautiful Soup.

My current Code:

import requests
from bs4 import BeautifulSoup
url = 'https://www.premierleague.com/match/7468'
j = requests.get(url)
soup = BeautifulSoup(j.content, "lxml")

containedDateTag_string = soup.find_all('div', class_="matchDate renderMatchDateContainer")
print (containedDateTag_string)

When I run it the printed output does not contain the "Sat 13 Aug 2011" and is simply stored and printed as:

[<div class="matchDate renderMatchDateContainer" data-kickoff="1313244000000"></div>]

Is there a way that I can get this string to be displayed? I have also tried parsing further through the tag with ".next_sibling" and ".text" with both displaying "[]" rather than the desired string which is why I reverted back to trying just 'div' to see if I could at least get the text to display.

Upvotes: 0

Views: 584

Answers (2)

jlaur
jlaur

Reputation: 740

Without Selenium - but using requests and the sites own API - it would look like something this (sure, you'd grab a bunch of other data about each game, but here's just for code for the date-part):

import requests
from time import sleep

def scraper(match_id):
    headers = {
    "Origin":"https://www.premierleague.com",
    "Referer":"https://www.premierleague.com/match/%d" % match_id
    }

    api_endpoint = "https://footballapi.pulselive.com/football/broadcasting-schedule/fixtures/%d" % match_id
    r = requests.get(api_endpoint, headers=headers)
    if not r.status_code == 200:
        return None
    else:
        data = r.json()
        # this will return something like this:
        # {'broadcasters': [],
        #  'fixture': {'attendance': 25700,
        #              'clock': {'label': "90 +4'00", 'secs': 5640},
        #              'gameweek': {'gameweek': 1, 'id': 744},
        #              'ground': {'city': 'London', 'id': 16, 'name': 'Craven Cottage'},
        #              'id': 7468,
        #              'kickoff': {'completeness': 3,
        #                          'gmtOffset': 1.0,
        #                          'label': 'Sat 13 Aug 2011, 15:00 BST',
        #                          'millis': 1313244000000},
        #              'neutralGround': False,
        #              'outcome': 'D',
        #              'phase': 'F',
        #              'replay': False,
        #              'status': 'C',
        #              'teams': [{'score': 0,
        #                         'team': {'club': {'abbr': 'FUL',
        #                                           'id': 34,
        #                                           'name': 'Fulham'},
        #                                  'id': 34,
        #                                  'name': 'Fulham',
        #                                  'shortName': 'Fulham',
        #                                  'teamType': 'FIRST'}},
        #                        {'score': 0,
        #                         'team': {'club': {'abbr': 'AVL',
        #                                           'id': 2,
        #                                           'name': 'Aston Villa'},
        #                                  'id': 2,
        #                                  'name': 'Aston Villa',
        #                                  'shortName': 'Aston Villa',
        #                                  'teamType': 'FIRST'}}]}}

        return data

match_id = 7468
json_blob = scraper(match_id)
if json_blob is not None:
    date = json_blob['fixture']['kickoff']['label']
    print(date)

You need the header with those two parameters to get the data. So if you had a bunch of match_id's you could just loop through them with this function going:

for match_id in range(7000,8000,1):
    json_blob = scraper(match_id)
    if json_blob is not None:
            date = json_blob['fixture']['kickoff']['label']
            print(date)
            sleep(1)

Upvotes: 0

Vin&#237;cius Figueiredo
Vin&#237;cius Figueiredo

Reputation: 6518

Scraping the content using .page_source using selenium/ChromeDriver is the way to go here, since the date text is being generated by JavaScript:

from selenium import webdriver
from bs4 import BeautifulSoup

url = "https://www.premierleague.com/match/7468"
driver = webdriver.Chrome()
driver.get(url)

soup = BeautifulSoup(driver.page_source, 'lxml')

Then you can do your .find the way you were doing:

>>> soup.find('div', {'class':"matchDate renderMatchDateContainer"}).text

'Sat 13 Aug 2011'

A batteries included solution with selenium itself:

>>> driver.find_element_by_css_selector("div.matchDate.renderMatchDateContainer").text
'Sat 13 Aug 2011'

Upvotes: 1

Related Questions