Reputation: 5
From this Tag:
<div class="matchDate renderMatchDateContainer" data-kickoff="1313244000000">Sat 13 Aug 2011</div>
I want to extract the "Sat 13 Aug 2011" using bs4 Beautiful Soup.
My current Code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.premierleague.com/match/7468'
j = requests.get(url)
soup = BeautifulSoup(j.content, "lxml")
containedDateTag_string = soup.find_all('div', class_="matchDate renderMatchDateContainer")
print (containedDateTag_string)
When I run it the printed output does not contain the "Sat 13 Aug 2011" and is simply stored and printed as:
[<div class="matchDate renderMatchDateContainer" data-kickoff="1313244000000"></div>]
Is there a way that I can get this string to be displayed? I have also tried parsing further through the tag with ".next_sibling" and ".text" with both displaying "[]" rather than the desired string which is why I reverted back to trying just 'div' to see if I could at least get the text to display.
Upvotes: 0
Views: 584
Reputation: 740
Without Selenium - but using requests and the sites own API - it would look like something this (sure, you'd grab a bunch of other data about each game, but here's just for code for the date-part):
import requests
from time import sleep
def scraper(match_id):
headers = {
"Origin":"https://www.premierleague.com",
"Referer":"https://www.premierleague.com/match/%d" % match_id
}
api_endpoint = "https://footballapi.pulselive.com/football/broadcasting-schedule/fixtures/%d" % match_id
r = requests.get(api_endpoint, headers=headers)
if not r.status_code == 200:
return None
else:
data = r.json()
# this will return something like this:
# {'broadcasters': [],
# 'fixture': {'attendance': 25700,
# 'clock': {'label': "90 +4'00", 'secs': 5640},
# 'gameweek': {'gameweek': 1, 'id': 744},
# 'ground': {'city': 'London', 'id': 16, 'name': 'Craven Cottage'},
# 'id': 7468,
# 'kickoff': {'completeness': 3,
# 'gmtOffset': 1.0,
# 'label': 'Sat 13 Aug 2011, 15:00 BST',
# 'millis': 1313244000000},
# 'neutralGround': False,
# 'outcome': 'D',
# 'phase': 'F',
# 'replay': False,
# 'status': 'C',
# 'teams': [{'score': 0,
# 'team': {'club': {'abbr': 'FUL',
# 'id': 34,
# 'name': 'Fulham'},
# 'id': 34,
# 'name': 'Fulham',
# 'shortName': 'Fulham',
# 'teamType': 'FIRST'}},
# {'score': 0,
# 'team': {'club': {'abbr': 'AVL',
# 'id': 2,
# 'name': 'Aston Villa'},
# 'id': 2,
# 'name': 'Aston Villa',
# 'shortName': 'Aston Villa',
# 'teamType': 'FIRST'}}]}}
return data
match_id = 7468
json_blob = scraper(match_id)
if json_blob is not None:
date = json_blob['fixture']['kickoff']['label']
print(date)
You need the header with those two parameters to get the data. So if you had a bunch of match_id's you could just loop through them with this function going:
for match_id in range(7000,8000,1):
json_blob = scraper(match_id)
if json_blob is not None:
date = json_blob['fixture']['kickoff']['label']
print(date)
sleep(1)
Upvotes: 0
Reputation: 6518
Scraping the content using .page_source
using selenium
/ChromeDriver
is the way to go here, since the date text is being generated by JavaScript:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://www.premierleague.com/match/7468"
driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
Then you can do your .find
the way you were doing:
>>> soup.find('div', {'class':"matchDate renderMatchDateContainer"}).text
'Sat 13 Aug 2011'
A batteries included solution with selenium itself:
>>> driver.find_element_by_css_selector("div.matchDate.renderMatchDateContainer").text
'Sat 13 Aug 2011'
Upvotes: 1