Reputation: 621

Webscraping for all the combination of dropdown extract values

I am trying to extract price information and h1 tag information from cars.com, there is a dropdown list to search the page.

I want to select different models and search for price. But 'model' selection is dependent on 'make'. I have all combination of dropdown using selenium. For each of the dropdown combination how can I get H1 information like "soup.find("H1")"

code is as follows

from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time

driver = webdriver.Chrome('C:/Users/chromedriver.exe')
driver.get('https://www.cars.com/')
time.sleep(4)

selectMake = Select(driver.find_element_by_name("makeId"))


time.sleep(2)


selectModel = Select(driver.find_element_by_name("modelId"))

data = []
for makesOption in selectMake.options:
    makesText = makesOption.text
    selectMake.select_by_visible_text(makesText)
    time.sleep(1)
    selectModel = Select(driver.find_element_by_name("modelId"))
    for modelOption in selectModel.options:
        modelText = modelOption.text
        selectModel.select_by_visible_text(modelText)
        data.append([makesText,modelText])

Upvotes: 2

Answers (2)

Pedro Lobito

Reputation: 98871

All the information you need is contained on json objects on the source code of each page, luckly, javascript isn't needed to retrieve them, being so, you don't need selenium, which is slow by nature, and you can simply use requests to retrieve the json object and convert it to a python object, i.e.:

x = requests.get("https://cars.com")
if x.status_code == 200:
    js_obj = re.findall("REDUX_STATE = (.*?)</script>", x.text, re.IGNORECASE | re.MULTILINE)
    if js_obj:
        j_obj = json.loads(js_obj[0]) # check the tree view of the object on notes

The json object on the main page of cars.com contains all makes and models, which are represented by ids, i.e.:

mkId = Make ID
mdId = Model ID

With this information we can construct search queries for all makes and models:

https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId=20773&mkId=20001&page=1&perPage=100

cars.com.py

import requests, re, json

x = requests.get("https://cars.com")
if x.status_code == 200:
    js_obj = re.findall("REDUX_STATE = (.*?)</script>", x.text, re.IGNORECASE | re.MULTILINE)
    if js_obj:
        j_obj = json.loads(js_obj[0])
        for model in j_obj['home']['makeModels']['models'][:1]: # remove [:1] to parse all makes and models
            mkId = model['makeId']
            mdId = model['id']
            label = model['label']
            name = model['name']

            #print(mkId, mdId, label, name)
            # 20001 20773 CL CL

            s_url = f"https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId={mdId}&mkId={mkId}&page=1&perPage=100"

            s_page = requests.get(s_url)
            if s_page.status_code == 200:

                s_html = re.findall(r"CARS\.digitalData = (.*?);\s+</script>", s_page.text, re.IGNORECASE | re.MULTILINE)
                if s_html:
                    s_obj = json.loads(s_html[0])
                    if "page" in s_obj:
                        if "vehicle" in s_obj['page']:
                            for v in s_obj["page"]['vehicle']:
                                v_price = v['price']
                                v_make = v['make']
                                v_mileage = v['mileage']
                                #...
                                print(v_make, name, v_price, v_mileage)

Output: (v_make, name, v_price, v_mileage)

Acura CL 8800 43000
Acura CL 3999 116577
Acura CL 6995 62382
Acura CL 6987 63871
Acura CL 6777 136676
Acura CL 1995 172911
Acura CL 2995 170234
Acura CL 3989 240799
Acura CL 1999 124000
Acura CL 4998 39322
Acura CL 3175 105200
Acura CL 6995 129558
Acura CL 4000 153000
Acura CL 2295 147056
Acura CL 4800 156000
Acura CL 3995 170558
Acura CL 1500 197000
Acura CL 1750 177392
Acura CL 1750 133094
Acura CL 3999 140618
Acura CL 2500 175600
Acura CL 1100 240000
Acura CL 5950 115055
Acura CL 1995 93419
Acura CL None 167456
Acura CL 5900 93500
Acura CL 3444 193000
Acura CL 3900 161756
Acura CL None 125231
Acura CL 3150 202201
Acura CL 5998 130017
Acura CL 5000 158955
Acura CL 3288 0
Acura CL 3300 153713
Acura CL None 202147

Notes:

Python DEMO
Script needs improvement to catch errors

Upvotes: 1

Sers

Reputation: 12255

You init Select but not select anything, find detailed information how to use select here.

Using WebDriverWait you can wait for specific condition of the elements. In code below, instead of sleep I used wait.until(EC.element_to_be_clickable((By.NAME, "makeId"))), where Selenium will check if the element is clickable every 0.5 seconds with timeout in 10 seconds and as soon as it meets clickable criteria will move forward.

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

driver = webdriver.Chrome('C:/Users/chromedriver.exe')
wait = WebDriverWait(driver, 10)

driver.get('https://www.cars.com/')

select_make = Select(wait.until(EC.element_to_be_clickable((By.NAME, "makeId"))))
select_make.select_by_visible_text("BMW")

select_model = Select(wait.until(EC.element_to_be_clickable((By.NAME, "modelId"))))
select_model.select_by_visible_text("- M850 Gran Coupe")

Upvotes: 1

Webscraping for all the combination of dropdown extract values

Answers (2)

Related Questions