Reputation: 621
I am trying to extract price information and h1 tag information from cars.com, there is a dropdown list to search the page.
I want to select different models and search for price. But 'model' selection is dependent on 'make'. I have all combination of dropdown using selenium. For each of the dropdown combination how can I get H1 information like "soup.find("H1")"
code is as follows
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
driver = webdriver.Chrome('C:/Users/chromedriver.exe')
driver.get('https://www.cars.com/')
time.sleep(4)
selectMake = Select(driver.find_element_by_name("makeId"))
time.sleep(2)
selectModel = Select(driver.find_element_by_name("modelId"))
data = []
for makesOption in selectMake.options:
makesText = makesOption.text
selectMake.select_by_visible_text(makesText)
time.sleep(1)
selectModel = Select(driver.find_element_by_name("modelId"))
for modelOption in selectModel.options:
modelText = modelOption.text
selectModel.select_by_visible_text(modelText)
data.append([makesText,modelText])
Upvotes: 2
Views: 684
Reputation: 98871
All the information you need is contained on json
objects on the source code of each page, luckly, javascript
isn't needed to retrieve them, being so, you don't need selenium
, which is slow by nature, and you can simply use requests
to retrieve the json
object and convert it to a python
object, i.e.:
x = requests.get("https://cars.com")
if x.status_code == 200:
js_obj = re.findall("REDUX_STATE = (.*?)</script>", x.text, re.IGNORECASE | re.MULTILINE)
if js_obj:
j_obj = json.loads(js_obj[0]) # check the tree view of the object on notes
The json object on the main page of cars.com contains all makes and models, which are represented by ids, i.e.:
mkId
= Make IDmdId
= Model IDWith this information we can construct search queries for all makes and models:
https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId=20773&mkId=20001&page=1&perPage=100
cars.com.py
import requests, re, json
x = requests.get("https://cars.com")
if x.status_code == 200:
js_obj = re.findall("REDUX_STATE = (.*?)</script>", x.text, re.IGNORECASE | re.MULTILINE)
if js_obj:
j_obj = json.loads(js_obj[0])
for model in j_obj['home']['makeModels']['models'][:1]: # remove [:1] to parse all makes and models
mkId = model['makeId']
mdId = model['id']
label = model['label']
name = model['name']
#print(mkId, mdId, label, name)
# 20001 20773 CL CL
s_url = f"https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId={mdId}&mkId={mkId}&page=1&perPage=100"
s_page = requests.get(s_url)
if s_page.status_code == 200:
s_html = re.findall(r"CARS\.digitalData = (.*?);\s+</script>", s_page.text, re.IGNORECASE | re.MULTILINE)
if s_html:
s_obj = json.loads(s_html[0])
if "page" in s_obj:
if "vehicle" in s_obj['page']:
for v in s_obj["page"]['vehicle']:
v_price = v['price']
v_make = v['make']
v_mileage = v['mileage']
#...
print(v_make, name, v_price, v_mileage)
Output: (v_make, name, v_price, v_mileage)
Acura CL 8800 43000
Acura CL 3999 116577
Acura CL 6995 62382
Acura CL 6987 63871
Acura CL 6777 136676
Acura CL 1995 172911
Acura CL 2995 170234
Acura CL 3989 240799
Acura CL 1999 124000
Acura CL 4998 39322
Acura CL 3175 105200
Acura CL 6995 129558
Acura CL 4000 153000
Acura CL 2295 147056
Acura CL 4800 156000
Acura CL 3995 170558
Acura CL 1500 197000
Acura CL 1750 177392
Acura CL 1750 133094
Acura CL 3999 140618
Acura CL 2500 175600
Acura CL 1100 240000
Acura CL 5950 115055
Acura CL 1995 93419
Acura CL None 167456
Acura CL 5900 93500
Acura CL 3444 193000
Acura CL 3900 161756
Acura CL None 125231
Acura CL 3150 202201
Acura CL 5998 130017
Acura CL 5000 158955
Acura CL 3288 0
Acura CL 3300 153713
Acura CL None 202147
Notes:
Upvotes: 1
Reputation: 12255
You init Select
but not select anything, find detailed information how to use select here.
Using WebDriverWait you can wait for specific condition of the elements. In code below, instead of sleep I used wait.until(EC.element_to_be_clickable((By.NAME, "makeId")))
, where Selenium will check if the element is clickable every 0.5 seconds with timeout in 10 seconds and as soon as it meets clickable criteria will move forward.
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Chrome('C:/Users/chromedriver.exe')
wait = WebDriverWait(driver, 10)
driver.get('https://www.cars.com/')
select_make = Select(wait.until(EC.element_to_be_clickable((By.NAME, "makeId"))))
select_make.select_by_visible_text("BMW")
select_model = Select(wait.until(EC.element_to_be_clickable((By.NAME, "modelId"))))
select_model.select_by_visible_text("- M850 Gran Coupe")
Upvotes: 1