Brendan Rodgers
Brendan Rodgers

Reputation: 305

Python - Beautifulsoup - Only one result being returned

I am attempting to scrape sports schedule data from the link below

https://sport-tv-guide.live/live/darts

I am using the following code below

import requests
from bs4 import BeautifulSoup

def makesoup(url):
    page=requests.get(url)
    return BeautifulSoup(page.text,"lxml")
   
    
def matchscrape(g_data):


    for match in g_data:
        datetimes = match.find('div', class_='main time col-sm-2 hidden-xs').text.strip()
        print("DateTimes; ", datetimes) 
        print('-' *80)
        
def matches():
    soup=makesoup(url = "https://sport-tv-guide.live/live/darts")
    matchscrape(g_data = soup.findAll("div", {"class": "listData"}))

The issue I am having is only the first result is being returned (see below)

Error output

whereas there should be two values outputted (see below)

Expected

I printed the output received from running

def matches():
    soup=makesoup(url = "https://sport-tv-guide.live/live/darts")
    matchscrape(g_data = soup.findAll("div", {"class": "listData"}))

and it appears for some reason only the first result is being returned in the HTML (see below), which would lead to why only the first result is being returned, as this is the only result that can be found from the HTML received. What I am unsure of is why Beautifulsoup is not outputting the whole HTML so all the results can be outputted?

errorhtml

Thanks to anyone who can assist or solve this issue.

Upvotes: 2

Views: 574

Answers (4)

Brendan Rodgers
Brendan Rodgers

Reputation: 305

After the helpful answers above I was able to identify that the issue was that a cookie was stored on the site containing information on countries selected by user to show sport schedule data. In this example there was a listing at 18:00 for a channel in Australia. This was initially not showing in output via my code above due to the request received from requests module not having cookie data.

I was able to provide the necessary cookie information via the code below

def makesoup(url):
    cookies = {'mycountries' : '101,28,3,102,42,10,18,4,2'} # pass cookie data
    r = requests.post(url,  cookies=cookies)
    return BeautifulSoup(r.text,"html.parser")

and the correct information was now outputted

correctoutput

Just posting this answer in case it helps someone with a similar problem in the future.

Upvotes: 1

thoerni
thoerni

Reputation: 555

As @Ycmelon already answered, theres only 1 timestamp for me too. Still, theres something else that might cause the problem. As in this case, websites often have dyncamic content, and in some cases this content isn't always loaded correctly with requests.

If you can be really sure the problem is that requests doesn't fetch the site correctly, try requests_html (pip install requests-html), which opens up a session that definitely loads all dyncamic content:

from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
request = session.get(LINK)
html = BeautifulSoup(request.text, "html.parser")

Upvotes: 2

Benjamin
Benjamin

Reputation: 3448

Your matchscrape function is wrong. Instead of match.find function, which returns the first item, you should use the same way as in matches function the match.findAll function. Then iterate over the found datetimes like in example below.

def matchscrape(g_data):
    for match in g_data:
        datetimes = match.findAll('div', class_='main time col-sm-2 hidden-xs')
        for datetime in datetimes:
            print("DateTimes; ", datetime.text.strip())
            print('-' * 80)

Second thing is parsing the html page. The page is written in html so you should probably use the BeautifulSoup(page.text, 'html.parser') instead of lxml

Upvotes: 2

Andrej Kesely
Andrej Kesely

Reputation: 195438

There's only one time for today, but you can get times for tomorrow by first making POST requests with wanted date and the reloading the page.

For example:

import requests
from bs4 import BeautifulSoup


url = 'https://sport-tv-guide.live/live/darts'
select_date_url = 'https://sport-tv-guide.live/ajaxdata/selectdate'

with requests.session() as s:
    # print times for today:
    print('Times for today:')
    soup = BeautifulSoup(s.get(url).content, 'html.parser')
    for t in soup.select('.time'):
        print(t.get_text(strip=True, separator=' '))

    # select tomorrow:
    s.post(select_date_url, data={'d': '2020-07-19'}).text

    # print times for tomorrow:
    print('Times for 2020-07-19:')
    soup = BeautifulSoup(s.get(url).content, 'html.parser')
    for t in soup.select('.time'):
        print(t.get_text(strip=True, separator=' '))

Prints:

Times for today:
Darts 17:05
Times for 2020-07-19:
Darts 19:05
Darts 19:05

Upvotes: 1

Related Questions