Reputation:
I have a code to collect all of the URLs from the "oddsportal" website for a page:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
source = requests.get("https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/",headers=headers)
soup = BeautifulSoup(source.text, 'html.parser')
main_div=soup.find("div",class_="main-menu2 main-menu-gray")
a_tag=main_div.find_all("a")
for i in a_tag:
print(i['href'])
which returns these results:
/soccer/africa/africa-cup-of-nations/results/
/soccer/africa/africa-cup-of-nations-2019/results/
/soccer/africa/africa-cup-of-nations-2017/results/
/soccer/africa/africa-cup-of-nations-2015/results/
/soccer/africa/africa-cup-of-nations-2013/results/
/soccer/africa/africa-cup-of-nations-2012/results/
/soccer/africa/africa-cup-of-nations-2010/results/
/soccer/africa/africa-cup-of-nations-2008/results/
I would like the URLs to be returned as:
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/2/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/3/
for all the parent urls generated for results
.
I can see that the urls can be appended as seen from inspect element as below for div id = "pagination"
Upvotes: 1
Views: 216
Reputation: 20018
The data under id="pagination"
is loaded dynamically, so requests
won't support it.
However, you can get the table of all those pages (1-3) via sending a GET
request to:
https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/MN8PaiBs/X0/1/0/{page}/?_={timestampe}"
where {page}
is corresponding to the page number (1-3) and {timestampe}
is the current time
You'll also need to add:
"Referer": "https://www.oddsportal.com/"
to your headers
.
also, use the lxml
parser instead of html.parser
to avoid a RecursionError
.
import re
import requests
from datetime import datetime
from bs4 import BeautifulSoup
headers = {
"User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Referer": "https://www.oddsportal.com/",
}
with requests.Session() as session:
session.headers.update(headers)
for page in range(1, 4):
response = session.get(
f"https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/MN8PaiBs/X0/1/0/{page}/?_={datetime.now().timestamp()}"
)
table_data = re.search(r'{"html":"(.*)"}', response.text).group(1)
soup = BeautifulSoup(table_data, "lxml")
print(soup.prettify())
Upvotes: 2