corianne1234
corianne1234

Reputation: 724

efficient web scraping Python

Hi I am new to web scraping and would like to scrape a website with beautifulsoop. Now I'm wondering about how to write efficient code. This is about a bike website and they have several bikes and for each they have the features price, state, distance and duration. They all have the same class "product-feat". What is the most efficient way to get all these features into a pandas dataframe? I'm asking in particular because all features have the same class and looping seems inefficient to me.

<div class="product-smalltext">
<p class="product-feat">
<span>GBP 6'900 <small>(changeable)</small></span><br/> price
                </p>
<p class="product-feat">
<span>new</span><br/> state
                </p>
<p class="product-feat">
<span>10'000 km</span><br/> distance
                </p>
<p class="product-feat">
<span>48 months</span><br/> duration
                  </p>

I did it with a dicitionary (that I will afterwards convert to a dataframe) and a double loop, see below. But is this really the most efficient way to do it? With loops and append/extend? Or is there a better way?

#empty dictionary
content={'price':[], 
         'state':[],
         'distance':[],
         'duration':[]
        }


listing_content = soup.find_all("div",{"class":"product-smalltext"})
#double loop through all bikes and features
for smalltext in listing_content:
    for feat in smalltext.find_all("p",{"class":"product-feat"}):
        content['price'].extend(re.findall(r'(?s)^.*?(?=\s*price)', feat.get_text().strip().replace('\xa0',' ')))
        content['state'].extend(re.findall(r'(?s)^.*?(?=\s*state)', feat.getText().strip()))
        content['distance'].extend(re.findall(r'(?s)^.*?(?=\s*distance)', feat.getText().strip()))
        content['duration'].extend(re.findall(r'(?s)^.*?(?=\s*duration)', feat.getText().strip()))

Upvotes: 0

Views: 146

Answers (1)

perl
perl

Reputation: 9941

You can use list comprehension and directly construct a dataframe:

pd.DataFrame([[y.text
    for y in x.select('p.product-feat > span')]
    for x in soup.select('div[class="product-smalltext"]')],
    columns=['price', 'state', 'distance', 'duration'])

Output (I've added one more <div> to make sure it works with multiple items per page):

                     price state   distance   duration
0   GBP 6'900 (changeable)   new  10'000 km  48 months
1  GBP 16'900 (changeable)   old  20'000 km  48 months

Update: We can also extract names of the product features from the text and use it as column names. If some values are missing, we'll get NaNs in the dataframe:

pd.DataFrame([{y.find('br').next_sibling.strip(): y.find('span').text
    for y in x.select('p.product-feat')}
    for x in soup.select('div[class="product-smalltext"]')])

Update 2: With the specific website:

resp = requests.get('https://www.leasingmarkt.ch/listing')
soup = bs4.BeautifulSoup(resp.text)

df = pd.DataFrame([{y.find('br').next_sibling.strip(): y.find('span').text
    for y in x.select('p.product-feat')}
    for x in soup.select('div[class="product-smalltext"]')])

df

Output:


                Anzahlung Erstzulassung Jährliche Fahrleistung  \
0       CHF 0 (anpassbar)      Neuwagen  10'000 km (anpassbar)   
1  CHF 13'250 (anpassbar)      Neuwagen  10'000 km (anpassbar)   
2   CHF 2'000 (anpassbar)      Neuwagen              10'000 km   
3  CHF 10'000 (anpassbar)      Neuwagen  15'000 km (anpassbar)   
4   CHF 1'500 (anpassbar)      Neuwagen              10'000 km   
5   CHF 3'500 (anpassbar)      Neuwagen              10'000 km   
6   CHF 6'900 (anpassbar)      Neuwagen              10'000 km   
7  CHF 10'500 (anpassbar)      Neuwagen  10'000 km (anpassbar)   
8       CHF 0 (anpassbar)       08/2015              10'000 km   
9       CHF 0 (anpassbar)      Neuwagen  10'000 km (anpassbar)   

                Laufzeit Kilometerstand                  Leistung  
0  48 Monate (anpassbar)           0 km  204 PS (150 kW), Elektro  
1  48 Monate (anpassbar)           0 km   245 PS (180 kW), Benzin  
2              48 Monate           0 km   190 PS (140 kW), Benzin  
3  48 Monate (anpassbar)           0 km   150 PS (110 kW), Benzin  
4              48 Monate           0 km  204 PS (150 kW), Elektro  
5              48 Monate           0 km   320 PS (235 kW), Benzin  
6              48 Monate           0 km    129 PS (95 kW), Benzin  
7  48 Monate (anpassbar)           0 km   184 PS (135 kW), Benzin  
8              48 Monate      46'000 km   190 PS (140 kW), Diesel  
9  48 Monate (anpassbar)           0 km   320 PS (235 kW), Benzin  

Update 3: And here's with selenium:

url = 'https://www.leasingmarkt.ch/listing'
driver = webdriver.Chrome()
driver.get(url)

html = driver.page_source
soup = bs4.BeautifulSoup(html)

df = pd.DataFrame([{y.find('br').next_sibling.strip(): y.find('span').text
    for y in x.select('p.product-feat')}
    for x in soup.select('div[class="product-smalltext"]')])

Upvotes: 2

Related Questions