Reputation: 393
I am trying to get the name of a car model as it appears on the website but for some reason (after trying the all of the following), it doesn't seem to work.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.carsales.com.au/cars/results?offset=12"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
model_name = soup.find_all('a', attrs={'data-webm-clickvalue':'sv-view-title'})
final_model_name = model_name[1]
clean_model_name = final_model_name.text
clean_model_name = clean_model_name.replace("\r", "")
clean_model_name = clean_model_name.replace("\n", "")
clean_model_name = clean_model_name.strip()
clean_model_name = clean_model_name.rstrip()
print(clean_model_name)
I have also created a variable that contains the whole sentence I want to remove (which works) which is then parsed in the strip function, but the MY14 element of it changes based on the year of the car. Creating a variable for each year doesn't seem very efficient.
Some indexes return clean results, however, others return the following (scroll across):
2014 Holden Cruze SRi Z Series JH Series II Auto MY14 Manufacturer Marketing Year (MY) The manufacturer's marketing year of this model.
I don't need any of the details after the car model - after researching, strip()
should remove white space either side (but in this case it doesn't) and rstrip()
should remove everything to the right (but in this case it doesn't)
I have successfully created a for loop which loops through each of the cars on this page, but some rows in the DataFrame are extended due to the additional unwanted text.
Upvotes: 1
Views: 181
Reputation: 628
strip() would only remove the white space characters at the front and rear of the string that you are working with, you can try this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.carsales.com.au/cars/results?offset=12"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
model_name = soup.find_all('a', attrs={'data-webm-clickvalue':'sv-view-title'})
final_model_name = model_name[1]
clean_model_name = final_model_name.text
clean_model_name = clean_model_name.strip().split()[:5]
clean_model_name = ' '.join(clean_model_name)
print(clean_model_name)
I noticed that most of the model names have 5 key parts (the year, brand and the model) so I used [:5] to get the first five elements of the model name, but if you want to minus the first series element then just change the value to 3. strip() helps to split the model name by the spaces. Hope this helps
Upvotes: 1