user16304089
user16304089

Reputation:

How do I get the URLs for all the pages?

I have a code to collect all of the URLs from the "oddsportal" website for a page:

from bs4 import BeautifulSoup
import requests

headers = {
    'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
source = requests.get("https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/",headers=headers)

soup = BeautifulSoup(source.text, 'html.parser')
main_div=soup.find("div",class_="main-menu2 main-menu-gray")
a_tag=main_div.find_all("a")
for i in a_tag:
    print(i['href'])

which returns these results:

/soccer/africa/africa-cup-of-nations/results/
/soccer/africa/africa-cup-of-nations-2019/results/
/soccer/africa/africa-cup-of-nations-2017/results/
/soccer/africa/africa-cup-of-nations-2015/results/
/soccer/africa/africa-cup-of-nations-2013/results/
/soccer/africa/africa-cup-of-nations-2012/results/
/soccer/africa/africa-cup-of-nations-2010/results/
/soccer/africa/africa-cup-of-nations-2008/results/

I would like the URLs to be returned as:

https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/2/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/3/

for all the parent urls generated for results.

I can see that the urls can be appended as seen from inspect element as below for div id = "pagination"

Inspect Element

Upvotes: 1

Views: 216

Answers (1)

MendelG
MendelG

Reputation: 20018

The data under id="pagination" is loaded dynamically, so requests won't support it.

However, you can get the table of all those pages (1-3) via sending a GET request to:

https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/MN8PaiBs/X0/1/0/{page}/?_={timestampe}"

where {page} is corresponding to the page number (1-3) and {timestampe} is the current time

You'll also need to add:

"Referer": "https://www.oddsportal.com/"

to your headers. also, use the lxml parser instead of html.parser to avoid a RecursionError.

import re
import requests
from datetime import datetime
from bs4 import BeautifulSoup

headers = {
    "User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
    "Referer": "https://www.oddsportal.com/",
}


with requests.Session() as session:
    session.headers.update(headers)
    for page in range(1, 4):
        response = session.get(
            f"https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/MN8PaiBs/X0/1/0/{page}/?_={datetime.now().timestamp()}"
        )

        table_data = re.search(r'{"html":"(.*)"}', response.text).group(1)
        soup = BeautifulSoup(table_data, "lxml")
        print(soup.prettify())

Upvotes: 2

Related Questions