Reputation: 21

BeautifulSoup scrape URLs located within a csv then output to new csv

hope you're staying safe during this corona period.

I'm pretty new to python.

I am trying to scrape a website that sells car tyres. I am trying to get the 'Brand', 'Model', 'Price' of all the tyres in each of the URLs within 'urls.csv' and then export it to another csv. Here is a pastebin of the URLs within my urls.csv if it is any help.

I have searched on here for similar questions including this. However, even then they are pasting the URLs into the code and run that. I want to have clean code that looks into my csv and fetches the first URL, scrape it, put the results into the output csv then going back to the csv and getting the second URL, scrape it, put it on the second line of the output csv.

I have managed to make a scraper that can scrape a single URL when I manually put that URL into the code.(which was a big deal for me haha). My current code:

import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd

#I want to be able to swap this request for every URL in the csv
page = requests.get('https://www.beaurepaires.com.au/tyresize/155W_70R_13')

soup = BeautifulSoup(page.content, 'html.parser')

tyres = soup.find(class_='comp-product-details products-grid')

items = tyres.find_all(class_='item')

brand = [str(item.find(class_='dealer-logo')).split(' ')[1].split('=')[1].split('"')[1] for item in items]
model = [item.find(class_='product-title').get_text() for item in items]
price = [item.find(class_='main-price').get_text().split('/')[0].split(' ')[1] for item in items]

tyre_stuff = pd.DataFrame({
        'brand':brand, 
        'model':model, 
        'price':price, 
    })
print(tyre_stuff)
tyre_stuff.to_csv('beaurepaires_01.csv', mode='a', header=False)

Can someone point me to the right direction? I think I have to import csv and also I have a feeling I will not be able to use '.to_csv' command. I also feel I might have to use urllib or something similar.

EDIT1: Here are the URLs in my csv format

https://www.beaurepaires.com.au/tyresize/145W_65R_15
https://www.beaurepaires.com.au/tyresize/155W_65R_13
https://www.beaurepaires.com.au/tyresize/155W_65R_14
https://www.beaurepaires.com.au/tyresize/155W_70R_13

EDIT2: Here is my updated code as per Paul's help but I am still stuck on how to get csv.DictWriter working to output my results

def get_products(url):

    import requests
    from bs4 import BeautifulSoup

    response = requests.get(url)
    response.raise_for_status()

    soup = BeautifulSoup(response.content, "html.parser")

    for item in soup.findAll("div", {"class": "product-item show-online-price"}):
        title = item["title"]
        brand = item.find("img", {"class": "dealer-logo"})["alt"]
        price = item.find("span", {"class": "price"}).getText()

        yield {
            "title": title,
            "brand": brand,
            "price": price
        }


def get_all_products(filename):

    import csv

    with open(filename, "r", newline="") as file:
        for tokens in csv.reader(file):
            url = tokens[0]
            for product in get_products(url):
                yield product

def main():

    from time import sleep
    from random import randint
    import csv

    sleep(randint(3,11))

    with open('output.csv', 'w', newline='') as csvfile:
        fieldnames = ['title', 'brand','price']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    for product in get_all_products("urls.csv"):
        writer.writeheader()
        writer.writerow({'title', 'brand', 'price'})
        print(product)


    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Upvotes: 2

Answers (2)

Paul M.

Reputation: 10819

Something like this maybe? Right now it just prints the products as the requests are made, but it should be pretty easy to write them to a csv file. get_products is a generator that yields all the products from a given URL as dictionaries. It's called get_products rather than get_product because a given URL may have more than one product.

get_all_products is another generator that yields every product yielded by get_products for every URL in a given csv filename.

As others have mentioned, there isn't really any reason for you to save your URLs in a csv file, since you don't have comma separated values, or values seperated by delimiters of any kind. You just have a bunch of URLs, so why not store them in a plain text file? It would reduce a few lines of code, even.

def get_products(url):

    import requests
    from bs4 import BeautifulSoup

    response = requests.get(url)
    response.raise_for_status()

    soup = BeautifulSoup(response.content, "html.parser")

    for item in soup.findAll("div", {"class": "product-item show-online-price"}):
        title = item["title"]
        price = item.find("span", {"class": "price"}).getText()
        logo = item.find("img", {"class": "dealer-logo"})["src"]

        yield {
            "title": title,
            "price": price,
            "logo": logo
        }

def get_all_products(filename):

    with open(filename, "r", newline="") as file:
        for url in file.readlines():
            for product in get_products(url.strip()):
                yield product

def main():

    from csv import DictWriter

    with open("products.csv", "w", newline="") as file:
        field_names = ["title", "price", "logo"]
        writer = DictWriter(file, fieldnames=field_names)

        writer.writeheader()
        for product in get_all_products("urls.txt"):
            writer.writerow(product)
            file.flush()
            print(product)

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Upvotes: 2

Chase

Reputation: 5625

First off, let's wrap up the main scraping code in a function. We can later pass just the url to this function (and maybe also an enumeration index) for convenience and readability.

def scrape(url, index):
    page = requests.get(url, index)

    soup = BeautifulSoup(page.content, 'html.parser')

    tyres = soup.find(class_='comp-product-details products-grid')

    items = tyres.find_all(class_='item')

    brand = [str(item.find(class_='dealer-logo')).split(' ')[1].split('=')[1].split('"')[1] for item in items]
    model = [item.find(class_='product-title').get_text() for item in items]
    price = [item.find(class_='main-price').get_text().split('/')[0].split(' ')[1] for item in items]

    tyre_stuff = pd.DataFrame({
        'brand':brand, 
        'model':model, 
        'price':price, 
    })
    print(tyre_stuff)
    tyre_stuff.to_csv(f'beaurepaires_{index}.csv', header=False)

Now let's make the loop for passing in the url, after reading the csv ofcourse.

import csv

# other code

with open('links.csv', newline='') as links_file:
    links = csv.reader(links_file, delimiter=' ', quotechar='|')

    for i, link in enumerate(links):
        scrape(link, i)

Now that example, is directly from the docs. The parameters might need to change according to your specific csv format, but the main idea will surely be the same.

After reading in the links, you simply have to enumerate through them (basically looping but also keeping track of the index) and pass both the link/url and the iteration number (index) to the function.

To address your concerns, I don't know why you "feel like" you need to use urllib, and also why you "feel like" .to_csv will not work. If your DataFrame resulted in the data you want, doing .to_csv is very much valid. However you may want to look at the parameters, demonstrated here

Upvotes: 1

BeautifulSoup scrape URLs located within a csv then output to new csv

Answers (2)

Related Questions