nick12
nick12

Reputation: 3

how to loop using beautifulsoup

I am trying to scrape data on car model, price, mileage, location, etc using beautifulsoup. However, the return result only reports data on one random car. I want to be able to collect data on all cars advertised on the site to date. My python code is below. How can I modify my code to retrieve data such that each day I have information on car model, price, mileage, location, etc? Example:

  1. Car model price mileage location date
  2. Toyota Corrola $4500 22km Accra 16/02/2018
  3. Nissan Almera $9500 60km Tema 16/02/2018

etc

import requests
from bs4 import BeautifulSoup
import pandas
import csv
from datetime import datetime
for i in range(300):       
    url = "https://tonaton.com/en/ads/ghana/cars?".format(i) 
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
print soup.prettify()
data = soup.find(class_='item-content')

for tag in data:    
    item_title = data.find("a",attrs={"class":"item-title h4"})
    model = item_title.text.encode('utf-8').strip()
    item_meta = data.find("p",attrs={"class":"item-meta"})
    mileage = item_meta.text.encode('utf-8').strip()
    item_location = data.find("p",attrs={"class":"item-location"})
    location = item_location.text.encode('utf-8').strip()
    item_info = data.find("p",attrs={"class":"item-info"})
    price = item_info.text.encode('utf-8').strip()           
with open('example.csv', 'a') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow([model, price, mileage, location, datetime.now()])

Upvotes: 0

Views: 267

Answers (2)

David Owens
David Owens

Reputation: 675

First off, this loop:

for i in range(300):       
    url = "https://tonaton.com/en/ads/ghana/cars?".format(i)

is not doing what I assume you think it is. This loop simply resets the url 300 times and leaves you with the original url you set. You need to wrap all your code in this loop to ensure you are hitting each of the URLs you want (1-300).

Restructure your code (paying attention to indents!) so that the next url is the one being used in the request:

# This will print ALOT of titles
for i in range(300):
    url = "https://tonaton.com/en/ads/ghana/cars?" + str(i) 
    print(url) # Notice how the url changes with each iteration?
    r = requests.get(url)
    soup = bsoup(r.content, "html.parser")
    titles = soup.findAll("a",attrs={"class":"item-title h4"})
    for item in titles:
        currTitle = item.text.encode('utf-8').strip()
        print(currTitle)

This code:

import requests
from bs4 import BeautifulSoup as bsoup

url = "https://tonaton.com/en/ads/ghana/cars?1"
r = requests.get(url)
soup = bsoup(r.content, "html.parser")
titles = soup.findAll("a",attrs={"class":"item-title h4"})
for item in titles:
    print(item.text.encode('utf-8').strip())

Yields (not sure what the 'b' is doing):

b'Hyundai Veloster 2013'
b'Ford Edge 2009'
b'Mercedes-Benz C300 2016'
b'Mazda Demio 2007'
b'Hyundai Santa fe 2005'
# And so on...

The problem is that 1) if you call find(), it will stop after you find the first match given your params. Using findAll() will dump all matches into a list which you then can iterate through and process as needed. And 2) the result you get from a call to find() is a broken structure of the original HTML. Thus the next find() calls won't work.

Upvotes: 0

nick12
nick12

Reputation: 3

import requests
from bs4 import BeautifulSoup as bsoup
import csv
from datetime import datetime
for i in range(300):
url = "https://tonaton.com/en/ads/ghana/cars?".format(i)
r = requests.get(url) 
soup = bsoup(r.content, "html.parser") 
item_title = soup.findAll("a",attrs={"class":"item-title h4"})

for item in item_title:
        model = item.text.encode('utf-8').strip()

item_meta = soup.findAll("p",attrs={"class":"item-meta"})
for item in item_meta:
        milleage = item.text.encode('utf-8').strip()

item_location = soup.findAll("p",attrs={"class":"item-location"})
for item in item_location:
        location = item.text.encode('utf-8').strip()

item_info = soup.findAll("p",attrs={"class":"item-info"})        
for item in item_info:
        price = item.text.encode('utf-8').strip()

with open('index.csv', 'w') as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow([model, price, milleage, location, datetime.now()])

Upvotes: 0

Related Questions