how to loop using beautifulsoup

Question

I am trying to scrape data on car model, price, mileage, location, etc using beautifulsoup. However, the return result only reports data on one random car. I want to be able to collect data on all cars advertised on the site to date. My python code is below. How can I modify my code to retrieve data such that each day I have information on car model, price, mileage, location, etc? Example:

Car model price mileage location date
Toyota Corrola $4500 22km Accra 16/02/2018
Nissan Almera $9500 60km Tema 16/02/2018

etc

import requests
from bs4 import BeautifulSoup
import pandas
import csv
from datetime import datetime
for i in range(300):       
    url = "https://tonaton.com/en/ads/ghana/cars?".format(i) 
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print soup.prettify()
data = soup.find(class_='item-content')

for tag in data:    
    item_title = data.find("a",attrs={"class":"item-title h4"})
    model = item_title.text.encode('utf-8').strip()
    item_meta = data.find("p",attrs={"class":"item-meta"})
    mileage = item_meta.text.encode('utf-8').strip()
    item_location = data.find("p",attrs={"class":"item-location"})
    location = item_location.text.encode('utf-8').strip()
    item_info = data.find("p",attrs={"class":"item-info"})
    price = item_info.text.encode('utf-8').strip()           
with open('example.csv', 'a') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow([model, price, mileage, location, datetime.now()])

David Owens · Accepted Answer

First off, this loop:

for i in range(300):       
    url = "https://tonaton.com/en/ads/ghana/cars?".format(i)

is not doing what I assume you think it is. This loop simply resets the url 300 times and leaves you with the original url you set. You need to wrap all your code in this loop to ensure you are hitting each of the URLs you want (1-300).

Restructure your code (paying attention to indents!) so that the next url is the one being used in the request:

# This will print ALOT of titles
for i in range(300):
    url = "https://tonaton.com/en/ads/ghana/cars?" + str(i) 
    print(url) # Notice how the url changes with each iteration?
    r = requests.get(url)
    soup = bsoup(r.content, "html.parser")
    titles = soup.findAll("a",attrs={"class":"item-title h4"})
    for item in titles:
        currTitle = item.text.encode('utf-8').strip()
        print(currTitle)

This code:

import requests
from bs4 import BeautifulSoup as bsoup

url = "https://tonaton.com/en/ads/ghana/cars?1"
r = requests.get(url)
soup = bsoup(r.content, "html.parser")
titles = soup.findAll("a",attrs={"class":"item-title h4"})
for item in titles:
    print(item.text.encode('utf-8').strip())

Yields (not sure what the 'b' is doing):

b'Hyundai Veloster 2013'
b'Ford Edge 2009'
b'Mercedes-Benz C300 2016'
b'Mazda Demio 2007'
b'Hyundai Santa fe 2005'
# And so on...

The problem is that 1) if you call find(), it will stop after you find the first match given your params. Using findAll() will dump all matches into a list which you then can iterate through and process as needed. And 2) the result you get from a call to find() is a broken structure of the original HTML. Thus the next find() calls won't work.

how to loop using beautifulsoup

Answers (2)

Related Questions