pvmlad
pvmlad

Reputation: 85

How to code a for loop in Python for a web scraper

I'm writing a web scraping bot for the site AutoTrader, a popular car trading site in the UK and I'm trying to do as much as I can on my own, but I'm stuck as to how to get my script to do what I want.

Basically I want the bot to download certain information on the first 100 pages of listings for every car make and model, within a particular radius to my home. I also want the bot to stop trying to download the next pages of a particular brand/model car if there are no more new listings.

For instance if there are only 4 pages of listings and I ask it to download the listings on page 5, the web URL will automatically change to page 1, and the bot will download all the listings on page 1, then it would repeat this process for the next pages all the way up to page 100. Obviously I don't want 96 repeats of the cars on page 1 in my data set so I'd like to move onto the next model of car when this happens, but I haven't figured out a way to do that yet.

Here's what I have got so far:

for x in range(1, 101):
    makes = ["ABARTH", "AC", "AIXAM", "ARIEL", "ASTON%20MARTIN", "AUDI"]
    for make in makes:
        my_url_page_x_make_i = 'https://www.autotrader.co.uk/car-search?' + 'sort=distance' + '&postcode=BS247EY' + '&radius=300' + '&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New' + '&make=' + make + '&page=' + str(x)
        uClient = uReq(my_url_page_x_make_i)
        page_html = uClient.read()
        uClient.close()
        page_soup = soup(page_html, "html.parser")
        listings = page_soup.findAll("li", {"class": "search-page__result"})
        for listing in listings:
            information_container = listing.find("div", {"class": "information-container"})
            title_container = information_container.find("a", {
                "class": "js-click-handler listing-fpa-link tracking-standard-link"})
            title = title_container.text
            price = listing.find("div", {"class": "vehicle-price"}).text

            print("title: " + title)
            print("price: " + price)

            f.write(title.replace(",", "") + "," + price.replace(",", "") + "\n")
            if len(listings) < 13: makes.remove(make)

f.close()

This is far from a finished script and I only have about 1 week of real Python coding experience.

Upvotes: 2

Views: 1063

Answers (1)

Fr3ddyDev
Fr3ddyDev

Reputation: 474

I think I've solved your problem, but I'd suggest you to invert your loops: loop on makes before you loop on the pages. Keeping your original implementation, I solved the problem by scraping the numbers of the pages from the bottom of the page, that way you can stop whenever you run out of pages. I also corrected BeautifulSoup.findAll to BeautifulSoup.find_all because assuming you're using BeautifulSoup version 4 that method is deprecated

# please show your imports
from urllib.request import urlopen
from bs4 import BeautifulSoup
# I assume you imported BeautifulSoup as soup and urlopen as uReq


# I assume you opened a file object
with open('output.txt', 'w') as f:
    # for the aston martin, if you want this to be scalable, escape url invalid
    # chars using urllib.parse.quote()
    makes = ["ABARTH", "AC", "AIXAM", "ARIEL", "ASTON%20MARTIN", "AUDI"]
    # make it clear what variables are
    for page in range(1, 101):  # here I tested it with 9 pages for speed sake
        for make in makes:
            # don't overcomplicate variable names; here I believe that an f-string would be appropriate
            req_url = f"https://www.autotrader.co.uk/car-search?sort=distance&" \
                      f"postcode=BS247EY&radius=300&onesearchad=Used&onesearchad=Nearly%20New&" \
                      f"onesearchad=New&make={make}&page={page}"
            req = urlopen(req_url)
            page_html = req.read()
            req.close()
            page_soup = BeautifulSoup(page_html, "html.parser")
            # BeautifulSoup.findAll is deprecated use find_all instead
            listings = page_soup.find_all("li", {"class": "search-page__result"})
            for listing in listings:
                information_container = listing.find("div", {"class": "information-container"})
                title_container = information_container.find("a", {
                    "class": "js-click-handler listing-fpa-link tracking-standard-link"})
                title = title_container.text
                price = listing.find("div", {"class": "vehicle-price"}).text
                print("make:", make)
                print("title:", title)
                print("price:", price)
                f.write(title.replace(",", "") + "," + price.replace(",", "") + "\n")
            # Solving your issue
            # we take the page numbers from the bottom of the page and take the last
            # actually here it's the last but one (-2) because the last element would
            # be the arrow.
            pagination = page_soup.find_all('li', {'class': 'pagination--li'})[-2]
            # convert it to int and compare it to the current page
            # if it's less than or equal to the current page, remove
            # the make from the list.
            if int(pagination.text) <= page:
                makes.remove(make)

Upvotes: 1

Related Questions