Davide Tessarollo
Davide Tessarollo

Reputation: 87

Python Selenium hard webscraping

website is: https://www.jao.eu/auctions#/

you see 'OUT AREA' dropdown (I see a lot of ReactSelect...)

I need to get the full list of items contained in that list [AT, BDL-GB, BDL-NL, BE...].

Can you please help me?

wait = WebDriverWait(driver, 20)
driver.get('https://www.jao.eu/auctions#/')

first = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.css-1739xgv-control')))

first.click()

                                                                          
second = wait.until(......

Upvotes: 2

Views: 122

Answers (2)

SIM
SIM

Reputation: 22440

Try the following to fetch the required list of items from that site using requests module:

import requests

link = 'https://www.jao.eu/api/v1/auction/calls/getcorridors'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.post(link,json={})
    items = [item['value'] for item in res.json()]
    print(items)

Output are like (truncated):

'IT-CH', 'HU-SK', 'ES-PT', 'FR-IT', 'SK-CZ', 'NL-DK', 'IT-FR', 'HU-HR'

Upvotes: 1

Paul M.
Paul M.

Reputation: 10809

Logging ones network traffic reveals that the page makes several requests to REST APIs, one endpoint being getcorridors, whose response is JSON and contains all values from the dropdown(s). All you need to do is imitate that HTTP POST request. No Selenium required:

def get_corridors():
    import requests
    from operator import itemgetter

    url = "https://www.jao.eu/api/v1/auction/calls/getcorridors"

    headers = {
        "Accept": "application/json",
        "Accept-Encoding": "gzip, deflate",
        "Content-Type": "application/json",
        "User-Agent": "Mozilla/5.0"
    }

    response = requests.post(url, headers=headers, json={})
    response.raise_for_status()

    return list(map(itemgetter("value"), response.json()))
    

def main():

    for corridor in get_corridors():
        print(corridor)
    
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

IT-CH
HU-SK
ES-PT
FR-IT
SK-CZ
NL-DK
IT-FR
HU-HR
FR-ES
IT-GR
CZ-AT
DK-NL
SI-AT
CH-DE
...

Upvotes: 2

Related Questions