Reputation: 305
I am attempting to scrape sports schedule data from the link below
https://sport-tv-guide.live/live/darts
I am using the following code below
import requests
from bs4 import BeautifulSoup
def makesoup(url):
page=requests.get(url)
return BeautifulSoup(page.text,"lxml")
def matchscrape(g_data):
for match in g_data:
datetimes = match.find('div', class_='main time col-sm-2 hidden-xs').text.strip()
print("DateTimes; ", datetimes)
print('-' *80)
def matches():
soup=makesoup(url = "https://sport-tv-guide.live/live/darts")
matchscrape(g_data = soup.findAll("div", {"class": "listData"}))
The issue I am having is only the first result is being returned (see below)
whereas there should be two values outputted (see below)
I printed the output received from running
def matches():
soup=makesoup(url = "https://sport-tv-guide.live/live/darts")
matchscrape(g_data = soup.findAll("div", {"class": "listData"}))
and it appears for some reason only the first result is being returned in the HTML (see below), which would lead to why only the first result is being returned, as this is the only result that can be found from the HTML received. What I am unsure of is why Beautifulsoup is not outputting the whole HTML so all the results can be outputted?
Thanks to anyone who can assist or solve this issue.
Upvotes: 2
Views: 574
Reputation: 305
After the helpful answers above I was able to identify that the issue was that a cookie was stored on the site containing information on countries selected by user to show sport schedule data. In this example there was a listing at 18:00 for a channel in Australia. This was initially not showing in output via my code above due to the request received from requests module not having cookie data.
I was able to provide the necessary cookie information via the code below
def makesoup(url):
cookies = {'mycountries' : '101,28,3,102,42,10,18,4,2'} # pass cookie data
r = requests.post(url, cookies=cookies)
return BeautifulSoup(r.text,"html.parser")
and the correct information was now outputted
Just posting this answer in case it helps someone with a similar problem in the future.
Upvotes: 1
Reputation: 555
As @Ycmelon already answered, theres only 1 timestamp for me too. Still, theres something else that might cause the problem. As in this case, websites often have dyncamic content, and in some cases this content isn't always loaded correctly with requests.
If you can be really sure the problem is that requests doesn't fetch the site correctly, try requests_html
(pip install requests-html), which opens up a session that definitely loads all dyncamic content:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
request = session.get(LINK)
html = BeautifulSoup(request.text, "html.parser")
Upvotes: 2
Reputation: 3448
Your matchscrape
function is wrong. Instead of match.find
function, which returns the first item, you should use the same way as in matches
function the match.findAll
function. Then iterate over the found datetimes like in example below.
def matchscrape(g_data):
for match in g_data:
datetimes = match.findAll('div', class_='main time col-sm-2 hidden-xs')
for datetime in datetimes:
print("DateTimes; ", datetime.text.strip())
print('-' * 80)
Second thing is parsing the html page. The page is written in html
so you should probably use the BeautifulSoup(page.text, 'html.parser')
instead of lxml
Upvotes: 2
Reputation: 195438
There's only one time for today, but you can get times for tomorrow by first making POST requests with wanted date and the reloading the page.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://sport-tv-guide.live/live/darts'
select_date_url = 'https://sport-tv-guide.live/ajaxdata/selectdate'
with requests.session() as s:
# print times for today:
print('Times for today:')
soup = BeautifulSoup(s.get(url).content, 'html.parser')
for t in soup.select('.time'):
print(t.get_text(strip=True, separator=' '))
# select tomorrow:
s.post(select_date_url, data={'d': '2020-07-19'}).text
# print times for tomorrow:
print('Times for 2020-07-19:')
soup = BeautifulSoup(s.get(url).content, 'html.parser')
for t in soup.select('.time'):
print(t.get_text(strip=True, separator=' '))
Prints:
Times for today:
Darts 17:05
Times for 2020-07-19:
Darts 19:05
Darts 19:05
Upvotes: 1