AdrianC
AdrianC

Reputation: 393

How do I remove extra text to the right of a string?

I am trying to get the name of a car model as it appears on the website but for some reason (after trying the all of the following), it doesn't seem to work.

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.carsales.com.au/cars/results?offset=12"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
model_name = soup.find_all('a', attrs={'data-webm-clickvalue':'sv-view-title'})
final_model_name = model_name[1]
clean_model_name = final_model_name.text
clean_model_name = clean_model_name.replace("\r", "")
clean_model_name = clean_model_name.replace("\n", "")
clean_model_name = clean_model_name.strip()
clean_model_name = clean_model_name.rstrip()
print(clean_model_name)

I have also created a variable that contains the whole sentence I want to remove (which works) which is then parsed in the strip function, but the MY14 element of it changes based on the year of the car. Creating a variable for each year doesn't seem very efficient.

Some indexes return clean results, however, others return the following (scroll across):

2014 Holden Cruze SRi Z Series JH Series II Auto                                                     MY14                        Manufacturer Marketing Year (MY)                            The manufacturer's marketing year of this model.

I don't need any of the details after the car model - after researching, strip() should remove white space either side (but in this case it doesn't) and rstrip() should remove everything to the right (but in this case it doesn't)

I have successfully created a for loop which loops through each of the cars on this page, but some rows in the DataFrame are extended due to the additional unwanted text.

Upvotes: 1

Views: 181

Answers (1)

chngzm
chngzm

Reputation: 628

strip() would only remove the white space characters at the front and rear of the string that you are working with, you can try this:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.carsales.com.au/cars/results?offset=12"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
model_name = soup.find_all('a', attrs={'data-webm-clickvalue':'sv-view-title'})
final_model_name = model_name[1]
clean_model_name = final_model_name.text
clean_model_name = clean_model_name.strip().split()[:5]
clean_model_name = ' '.join(clean_model_name)
print(clean_model_name)

I noticed that most of the model names have 5 key parts (the year, brand and the model) so I used [:5] to get the first five elements of the model name, but if you want to minus the first series element then just change the value to 3. strip() helps to split the model name by the spaces. Hope this helps

Upvotes: 1

Related Questions