Reputation: 21
hope you're staying safe during this corona period.
I'm pretty new to python.
I am trying to scrape a website that sells car tyres. I am trying to get the 'Brand', 'Model', 'Price' of all the tyres in each of the URLs within 'urls.csv' and then export it to another csv. Here is a pastebin of the URLs within my urls.csv if it is any help.
I have searched on here for similar questions including this. However, even then they are pasting the URLs into the code and run that. I want to have clean code that looks into my csv and fetches the first URL, scrape it, put the results into the output csv then going back to the csv and getting the second URL, scrape it, put it on the second line of the output csv.
I have managed to make a scraper that can scrape a single URL when I manually put that URL into the code.(which was a big deal for me haha). My current code:
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
#I want to be able to swap this request for every URL in the csv
page = requests.get('https://www.beaurepaires.com.au/tyresize/155W_70R_13')
soup = BeautifulSoup(page.content, 'html.parser')
tyres = soup.find(class_='comp-product-details products-grid')
items = tyres.find_all(class_='item')
brand = [str(item.find(class_='dealer-logo')).split(' ')[1].split('=')[1].split('"')[1] for item in items]
model = [item.find(class_='product-title').get_text() for item in items]
price = [item.find(class_='main-price').get_text().split('/')[0].split(' ')[1] for item in items]
tyre_stuff = pd.DataFrame({
'brand':brand,
'model':model,
'price':price,
})
print(tyre_stuff)
tyre_stuff.to_csv('beaurepaires_01.csv', mode='a', header=False)
Can someone point me to the right direction? I think I have to import csv and also I have a feeling I will not be able to use '.to_csv' command. I also feel I might have to use urllib or something similar.
EDIT1: Here are the URLs in my csv format
https://www.beaurepaires.com.au/tyresize/145W_65R_15
https://www.beaurepaires.com.au/tyresize/155W_65R_13
https://www.beaurepaires.com.au/tyresize/155W_65R_14
https://www.beaurepaires.com.au/tyresize/155W_70R_13
EDIT2: Here is my updated code as per Paul's help but I am still stuck on how to get csv.DictWriter working to output my results
def get_products(url):
import requests
from bs4 import BeautifulSoup
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
for item in soup.findAll("div", {"class": "product-item show-online-price"}):
title = item["title"]
brand = item.find("img", {"class": "dealer-logo"})["alt"]
price = item.find("span", {"class": "price"}).getText()
yield {
"title": title,
"brand": brand,
"price": price
}
def get_all_products(filename):
import csv
with open(filename, "r", newline="") as file:
for tokens in csv.reader(file):
url = tokens[0]
for product in get_products(url):
yield product
def main():
from time import sleep
from random import randint
import csv
sleep(randint(3,11))
with open('output.csv', 'w', newline='') as csvfile:
fieldnames = ['title', 'brand','price']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
for product in get_all_products("urls.csv"):
writer.writeheader()
writer.writerow({'title', 'brand', 'price'})
print(product)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Upvotes: 2
Views: 704
Reputation: 10819
Something like this maybe? Right now it just prints the products as the requests are made, but it should be pretty easy to write them to a csv file. get_products
is a generator that yields all the products from a given URL as dictionaries. It's called get_products
rather than get_product
because a given URL may have more than one product.
get_all_products
is another generator that yields every product yielded by get_products
for every URL in a given csv filename.
As others have mentioned, there isn't really any reason for you to save your URLs in a csv file, since you don't have comma separated values, or values seperated by delimiters of any kind. You just have a bunch of URLs, so why not store them in a plain text file? It would reduce a few lines of code, even.
def get_products(url):
import requests
from bs4 import BeautifulSoup
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
for item in soup.findAll("div", {"class": "product-item show-online-price"}):
title = item["title"]
price = item.find("span", {"class": "price"}).getText()
logo = item.find("img", {"class": "dealer-logo"})["src"]
yield {
"title": title,
"price": price,
"logo": logo
}
def get_all_products(filename):
with open(filename, "r", newline="") as file:
for url in file.readlines():
for product in get_products(url.strip()):
yield product
def main():
from csv import DictWriter
with open("products.csv", "w", newline="") as file:
field_names = ["title", "price", "logo"]
writer = DictWriter(file, fieldnames=field_names)
writer.writeheader()
for product in get_all_products("urls.txt"):
writer.writerow(product)
file.flush()
print(product)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Upvotes: 2
Reputation: 5625
First off, let's wrap up the main scraping code in a function. We can later pass just the url
to this function (and maybe also an enumeration index) for convenience and readability.
def scrape(url, index):
page = requests.get(url, index)
soup = BeautifulSoup(page.content, 'html.parser')
tyres = soup.find(class_='comp-product-details products-grid')
items = tyres.find_all(class_='item')
brand = [str(item.find(class_='dealer-logo')).split(' ')[1].split('=')[1].split('"')[1] for item in items]
model = [item.find(class_='product-title').get_text() for item in items]
price = [item.find(class_='main-price').get_text().split('/')[0].split(' ')[1] for item in items]
tyre_stuff = pd.DataFrame({
'brand':brand,
'model':model,
'price':price,
})
print(tyre_stuff)
tyre_stuff.to_csv(f'beaurepaires_{index}.csv', header=False)
Now let's make the loop for passing in the url, after reading the csv ofcourse.
import csv
# other code
with open('links.csv', newline='') as links_file:
links = csv.reader(links_file, delimiter=' ', quotechar='|')
for i, link in enumerate(links):
scrape(link, i)
Now that example, is directly from the docs. The parameters might need to change according to your specific csv format, but the main idea will surely be the same.
After reading in the links, you simply have to enumerate through them (basically looping but also keeping track of the index) and pass both the link/url and the iteration number (index) to the function.
To address your concerns, I don't know why you "feel like" you need to use urllib, and also why you "feel like" .to_csv
will not work. If your DataFrame
resulted in the data you want, doing .to_csv
is very much valid. However you may want to look at the parameters, demonstrated here
Upvotes: 1