bobman
bobman

Reputation: 123

Scraping a dynamic URL that changes based on time using Selenium in Python

I am attempting to scrape the following URL:

https://www.oddsportal.com/soccer/england/premier-league/liverpool-norwich-4IMoMG3q/

Using the Network tab in Chrome's developer tools, you can see there is an API that feeds the data into the website - which is in the form of a JSON - and looks as below. This data is what I am trying to scrape.

https://fb.oddsportal.com/feed/match/1-1-4IMoMG3q-5-2-yj1e3.dat?_=1562831112277

This is the code I am currently trying to scrape this with:

from selenium.webdriver.chrome.options import Options
import json
import urllib.parse
from time import time

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)

# Access the initial webpage to create the info_dict (including the match_id, and hash)
driver.get('https://www.oddsportal.com/soccer/england/premier-league/liverpool-norwich-4IMoMG3q')
page = driver.page_source
info_dict = json.loads(page.split('var page = new PageEvent(')[-1].split(');')[0])
xhash = urllib.parse.unquote(info_dict['xhash'])
match_id = info_dict['id']

# Access to the feed URL based on the values from the info_dict
driver.get('http://fb.oddsportal.com/feed/match/1-1-{}-1-2-{}.dat?_={}'.format(match_id, xhash, int(round(time()*1000)) + 1000))
print(driver.page_source)

The URL is built up of 3 factors - the match_id, the hash, and epoch time in milliseconds. However, when I try to access this in Selenium, I get the following response:

globals.jsonpCallback('/feed/match/1-1-4IMoMG3q-1-2-yjb3a.dat?_=1562795864899', {'e':'404'});

Would really appreciate any help with this, as I don't really understand where I'm going wrong!

Upvotes: 2

Views: 1078

Answers (2)

leh
leh

Reputation: 53

I don't know how but you parse the wrong xhash.

If you parse the Liverpool-Norwich page, this page you can see that the xhash is '%79%6a%65%61%31' If you decode it it will give you 'yjea1' in your url.

With your code and the right xhash I get all the odds you are looking for !

Cheers

Upvotes: 1

leh
leh

Reputation: 53

I know it was asked a long time ago but it may help someone else another day, who knows.

As explained and resolved here the last part of your url, after dat?=, is calculated thanks to the current date and becomes ubiquitous some time afterwards.

If you generate it when you make the call, you'll get the data. For instance if you want the games of french Ligue1 of the season 2018-2019, a raw version of the code could be (you need to parse the page.text properly) :

import requests
import datetime

def timestamp_date():
    return int(datetime.datetime.now().timestamp()*1000)

url ='https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/Gji6p9u4/X0/1/0/8/?_='+str(timestamp_date())
headers = {
'User-Agent': 'curl/7.64.0',
'Referer': 'https://www.oddsportal.com/soccer/france/ligue-1-2018-2019/results/',
 }
page = requests.get(url, headers=headers)
page.text

Upvotes: 3

Related Questions