SN33DS
SN33DS

Reputation: 35

Python html parsing partial class names

I am trying to parse a webpage with bs4 but the elements I am trying to access all have different class names. Example: class='list-item listing … id-12984' and class='list-item listing … id-10359'

def preownedaston(url):
    preownedaston_resp = requests.get(url)

    if preownedaston_resp.status_code == 200:
        bs = BeautifulSoup(preownedaston_resp.text, 'lxml')
        posts = bs.find_all('div', class_='') #don't know what to put here
        for p in posts:
            title_year = p.find('div', class_='inset').find('a').find('span', class_='model_year').text
            print(title_year)

preownedaston('https://preowned.astonmartin.com/preowned-cars/search/?finance%5B%5D=price&price-currency%5B%5D=EUR&custom-model%5B404%5D%5B%5D=809&continent-country%5B%5D=France&postcode-area=United%20Kingdom&distance%5B%5D=0&transmission%5B%5D=Manual&budget-program%5B%5D=pay&section%5B%5D=109&order=-usd_price&pageId=3760')

Is there a way to parse a partial class name like class_='list-item '?

Upvotes: 1

Views: 984

Answers (2)

Martin Evans
Martin Evans

Reputation: 46759

The information from this URL actually comes back in JSON format which means you can easily extract the fields you want. For example:

import requests

url = "https://preowned.astonmartin.com/ajax/stock-listing/get-items/pageId/3760/ratio/3_2/taxBandImageLink/aHR0cHM6Ly9kMnBwMTFwZ29wNWY2cC5jbG91ZGZyb250Lm5ldC9UYXhCYW5kLSV0YXhfYmFuZCUuanBn/taxBandImageHyperlink/JWRlYWxlcl9lbWFpbCU=/imgWidth/767/?finance%5B%5D=price&price-currency%5B%5D=EUR&custom-model%5B404%5D%5B%5D=809&continent-country%5B%5D=France&distance%5B%5D=0&transmission%5B%5D=Manual&budget-program%5B%5D=pay&section%5B%5D=109&order=-usd_price&pageId=3760"

r = requests.get(url)
data = r.json()
details = ['make', 'mileage', 'model', 'model_year', 'mpg', 'exterior_colour', 'price_now']

for vehicle in data['vehicles']:
    print()
    for key in details:
        print(f"{key:18} : {vehicle[key]}")

This displays the following:

make               : Aston Martin
mileage            : 42,000 km
model              : V12 Vantage
model_year         : 2011
mpg                : 17.3
exterior_colour    : Carbon Black
price_now          : €114,900

make               : Aston Martin
mileage            : 42,000 km
model              : V12 Vantage
model_year         : 2011
mpg                : 17.3
exterior_colour    : Carbon Black
price_now          : €99,900

Note: it might be necessary to add a user agent request header if the data is not returned. If you display data you can see all of the available information for each vehicle.

This approach avoids the need to have javascript processing via Selenium and also avoids needing to parse any HTML using BeautifulSoup. The URL was found using the browser's network tools whilst the page was loading.

Upvotes: 2

Ahmed Soliman
Ahmed Soliman

Reputation: 1710

Css Selector for matching a partial value of a certain attribute is as follows :

div[class*='list-item'] # the * means match the class with this partial value 

But if you look at the source code of the page you will see that the content you are trying to scrape is being generated by Javascript So you have three options here

  1. Use Selenium with a headless browser to render the javescript
  2. Look for the Ajax calls and try to simulate them for example this url is the ajax call the website uses to retrieve the data Ajax URL
  3. Look for the data you are trying to scrape into a script tag as follows :

I prefer this one in similar situation because you will be parsing Json

import requests , json 
from bs4 import BeautifulSoup
URL = 'https://preowned.astonmartin.com/preowned-cars/search/?finance%5B%5D=price&price-currency%5B%5D=EUR&custom-model%5B404%5D%5B%5D=809&continent-country%5B%5D=France&postcode-area=United%20Kingdom&distance%5B%5D=0&transmission%5B%5D=Manual&budget-program%5B%5D=pay&section%5B%5D=109&order=-usd_price&pageId=3760'

page = requests.get(URL, headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"})
soup = BeautifulSoup(page.text, 'html.parser')
json_obj = soup.find('script',{'type':"application/ld+json"}).text
#{"@context":"http://schema.org","@graph":[{"@type":"Brand","name":""},{"@type":"OfferCatalog","itemListElement":[{"@type":"Offer","name":"Pre-Owned By Aston Martin","price":"€114,900.00","url":"https://preowned.astonmartin.com/preowned-cars/12984-aston-martin-v12-vantage-v8-volante/","itemOffered":{"@type":"Car","name":"Aston Martin V12 Vantage V8 Volante","brand":"Aston Martin","model":"V12 Vantage","itemCondition":"Used","category":"Used","productionDate":"2010","releaseDate":"2011","bodyType":"6.0 Litre V12","emissionsCO2":"388","fuelType":"Obsidian Black","mileageFromOdometer":"42000","modelDate":"2011","seatingCapacity":"2","speed":"190","vehicleEngine":"6l","vehicleInteriorColor":"Obsidian Black","color":"Black"}},{"@type":"Offer","name":"Pre-Owned By Aston Martin","price":"€99,900.00","url":"https://preowned.astonmartin.com/preowned-cars/10359-aston-martin-v12-vantage-carbon-edition-coupe/","itemOffered":{"@type":"Car","name":"Aston Martin V12 Vantage Carbon Edition Coupe","brand":"Aston Martin","model":"V12 Vantage","itemCondition":"Used","category":"Used","productionDate":"2011","releaseDate":"2011","bodyType":"6.0 Litre V12","emissionsCO2":"388","fuelType":"Obsidian Black","mileageFromOdometer":"42000","modelDate":"2011","seatingCapacity":"2","speed":"190","vehicleEngine":"6l","vehicleInteriorColor":"Obsidian Black","color":"Black"}}]},{"@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem","position":"1","item":{"@id":"https://preowned.astonmartin.com/","name":"Homepage"}},{"@type":"ListItem","position":"2","item":{"@id":"https://preowned.astonmartin.com/preowned-cars/","name":"Pre-Owned Cars"}},{"@type":"ListItem","position":"3","item":{"@id":"//preowned.astonmartin.com/preowned-cars/search/","name":"Pre-Owned By Aston Martin"}}]}]}
items = json.loads(json_obj)['@graph'][1]['itemListElement']
for item in items :
    print(item['itemOffered']['name'])

Output:

Aston Martin V12 Vantage V8 Volante
Aston Martin V12 Vantage Carbon Edition Coupe

Upvotes: 3

Related Questions