Reputation: 724
Hi I am new to web scraping and would like to scrape a website with beautifulsoop. Now I'm wondering about how to write efficient code. This is about a bike website and they have several bikes and for each they have the features price, state, distance and duration. They all have the same class "product-feat". What is the most efficient way to get all these features into a pandas dataframe? I'm asking in particular because all features have the same class and looping seems inefficient to me.
<div class="product-smalltext">
<p class="product-feat">
<span>GBP 6'900 <small>(changeable)</small></span><br/> price
</p>
<p class="product-feat">
<span>new</span><br/> state
</p>
<p class="product-feat">
<span>10'000 km</span><br/> distance
</p>
<p class="product-feat">
<span>48 months</span><br/> duration
</p>
I did it with a dicitionary (that I will afterwards convert to a dataframe) and a double loop, see below. But is this really the most efficient way to do it? With loops and append/extend? Or is there a better way?
#empty dictionary
content={'price':[],
'state':[],
'distance':[],
'duration':[]
}
listing_content = soup.find_all("div",{"class":"product-smalltext"})
#double loop through all bikes and features
for smalltext in listing_content:
for feat in smalltext.find_all("p",{"class":"product-feat"}):
content['price'].extend(re.findall(r'(?s)^.*?(?=\s*price)', feat.get_text().strip().replace('\xa0',' ')))
content['state'].extend(re.findall(r'(?s)^.*?(?=\s*state)', feat.getText().strip()))
content['distance'].extend(re.findall(r'(?s)^.*?(?=\s*distance)', feat.getText().strip()))
content['duration'].extend(re.findall(r'(?s)^.*?(?=\s*duration)', feat.getText().strip()))
Upvotes: 0
Views: 146
Reputation: 9941
You can use list comprehension and directly construct a dataframe:
pd.DataFrame([[y.text
for y in x.select('p.product-feat > span')]
for x in soup.select('div[class="product-smalltext"]')],
columns=['price', 'state', 'distance', 'duration'])
Output (I've added one more <div>
to make sure it works with multiple items per page):
price state distance duration
0 GBP 6'900 (changeable) new 10'000 km 48 months
1 GBP 16'900 (changeable) old 20'000 km 48 months
Update: We can also extract names of the product features from the text and use it as column names. If some values are missing, we'll get NaN
s in the dataframe:
pd.DataFrame([{y.find('br').next_sibling.strip(): y.find('span').text
for y in x.select('p.product-feat')}
for x in soup.select('div[class="product-smalltext"]')])
Update 2: With the specific website:
resp = requests.get('https://www.leasingmarkt.ch/listing')
soup = bs4.BeautifulSoup(resp.text)
df = pd.DataFrame([{y.find('br').next_sibling.strip(): y.find('span').text
for y in x.select('p.product-feat')}
for x in soup.select('div[class="product-smalltext"]')])
df
Output:
Anzahlung Erstzulassung Jährliche Fahrleistung \
0 CHF 0 (anpassbar) Neuwagen 10'000 km (anpassbar)
1 CHF 13'250 (anpassbar) Neuwagen 10'000 km (anpassbar)
2 CHF 2'000 (anpassbar) Neuwagen 10'000 km
3 CHF 10'000 (anpassbar) Neuwagen 15'000 km (anpassbar)
4 CHF 1'500 (anpassbar) Neuwagen 10'000 km
5 CHF 3'500 (anpassbar) Neuwagen 10'000 km
6 CHF 6'900 (anpassbar) Neuwagen 10'000 km
7 CHF 10'500 (anpassbar) Neuwagen 10'000 km (anpassbar)
8 CHF 0 (anpassbar) 08/2015 10'000 km
9 CHF 0 (anpassbar) Neuwagen 10'000 km (anpassbar)
Laufzeit Kilometerstand Leistung
0 48 Monate (anpassbar) 0 km 204 PS (150 kW), Elektro
1 48 Monate (anpassbar) 0 km 245 PS (180 kW), Benzin
2 48 Monate 0 km 190 PS (140 kW), Benzin
3 48 Monate (anpassbar) 0 km 150 PS (110 kW), Benzin
4 48 Monate 0 km 204 PS (150 kW), Elektro
5 48 Monate 0 km 320 PS (235 kW), Benzin
6 48 Monate 0 km 129 PS (95 kW), Benzin
7 48 Monate (anpassbar) 0 km 184 PS (135 kW), Benzin
8 48 Monate 46'000 km 190 PS (140 kW), Diesel
9 48 Monate (anpassbar) 0 km 320 PS (235 kW), Benzin
Update 3: And here's with selenium
:
url = 'https://www.leasingmarkt.ch/listing'
driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
soup = bs4.BeautifulSoup(html)
df = pd.DataFrame([{y.find('br').next_sibling.strip(): y.find('span').text
for y in x.select('p.product-feat')}
for x in soup.select('div[class="product-smalltext"]')])
Upvotes: 2