Reputation: 305
I am attempting to receive a list of urls that are on the following page
https://sport-tv-guide.live/live/tennis
When these URLS are gathered, I then need to pass each URL to a scrape function to scrape and output the relevant match data.
The data is correctly outputted if there is only one match on a specific page such as - https://sport-tv-guide.live/live/darts (see output below)
The issue occurs when I use a page with more than one link present such as - https://sport-tv-guide.live/live/tennis , it appears that the URLs are being scraped correctly (confirmed with using print, to print URLS) but they don't seem to be passed correctly for the content to be scraped, as the script just fails silently (see output below )
The code is below:
import requests
from bs4 import BeautifulSoup
def makesoup(url):
cookies = {'mycountries' : '101,28,3,102,42,10,18,4,2'}
r = requests.post(url, cookies=cookies)
return BeautifulSoup(r.text,"lxml")
def linkscrape(links):
baseurl = "https://sport-tv-guide.live"
urllist = []
for link in links:
finalurl = (baseurl+ link['href'])
urllist.append(finalurl)
# print(finalurl)
for singleurl in urllist:
soup2=makesoup(url=singleurl)
print(singleurl)
g_data=soup2.find_all('div', {'class': 'main col-md-4 eventData'})
for match in g_data:
hometeam = match.find('div', class_='cell40 text-center teamName1').text.strip()
awayteam = match.find('div', class_='cell40 text-center teamName2').text.strip()
dateandtime = match.find('div', class_='timeInfo').text.strip()
print("Match ; " + hometeam + "vs" + awayteam)
print("Date and Time; ", dateandtime)
def matches():
soup=makesoup(url = "https://sport-tv-guide.live/live/tennis")
linkscrape(links= soup.find_all('a', {'class': 'article flag', 'href' : True}))
I am assuming the issue is that when there are more than one URLs they are being passed as one large string rather than separate URLs, but I am unsure how I would get it to only pass each single URL at a time from the list of URLs to be scraped?
Thanks to anyone who can advise or help solve this issue.
Upvotes: 0
Views: 61
Reputation: 17368
After analysing the links, the 2 links point to different pages with different layouts.
https://sport-tv-guide.live/live/tennis - Using this link when you get all the links, they point to different page layout.
https://sport-tv-guide.live/live/darts - the links in this page point to this layout.
If you need to scrape the data from all the links from https://sport-tv-guide.live/live/tennis, the following script works.
import requests
from bs4 import BeautifulSoup
def makesoup(url):
cookies = {'mycountries' : '101,28,3,102,42,10,18,4,2'}
print(url)
r = requests.post(url, cookies=cookies)
return BeautifulSoup(r.text,"lxml")
def linkscrape(links):
baseurl = "https://sport-tv-guide.live"
urllist = []
for link in links:
finalurl = baseurl + link['href']
urllist.append(finalurl)
for singleurl in urllist:
soup2=makesoup(url=singleurl)
g_data=soup2.find('div', {'class': 'eventData'})
try:
teams = g_data.find_all("div", class_=["row","mb-5"])
print("HomeTeam - {}".format(teams[0].find("div", class_="main col-md-8 col-wrap").text.strip()))
print("AwayTeam - {}".format(teams[1].find("div", class_="main col-md-8 col-wrap").text.strip()))
channelInfo = g_data.find("div", {"id":"channelInfo"})
print("Time - {}".format(channelInfo.find("div", class_="time full").text.strip()))
print("Date - {}".format(channelInfo.find("div", class_="date full").text.strip()))
except :
print("Data not found")
def matches():
soup=makesoup(url = "https://sport-tv-guide.live/live/tennis")
linkscrape(links=soup.find_all('a', {'class': 'article flag', 'href' : True}))
matches()
Note: I have put try/except
because the links obtained from the page do not have the same layout.
Output:
https://sport-tv-guide.live/live/tennis
https://sport-tv-guide.live/event/live-tennis-national-tennis-centre-roehampton?uid=191007191100
Data not found
https://sport-tv-guide.live/event/bett1-aces-berlin/?uid=71916304
HomeTeam - Tommy Haas - Roberto Bautista-Agut
AwayTeam - Dominic Thiem - Jannik Sinner
Time - 11:15
Date - Sunday, 07-19-2020
https://sport-tv-guide.live/event/bett1-aces-berlin/?uid=71916307
HomeTeam - Tommy Haas - Roberto Bautista-Agut
AwayTeam - Dominic Thiem - Jannik Sinner
Time - 14:00
Date - Sunday, 07-19-2020
https://sport-tv-guide.live/event/bett1-aces-berlin/?uid=17207191605
HomeTeam - Tommy Haas - Roberto Bautista-Agut
AwayTeam - Dominic Thiem - Jannik Sinner
Time - 14:05
Date - Sunday, 07-19-2020
https://sport-tv-guide.live/event/world-teamtennis/?uid=161707191630102
Data not found
Upvotes: 1