Reputation: 79
I'm trying to webscrape Amazon's website to get data about their products. I'm getting the name, price, and currency of the product through Selenium Firefox and BeautifulSoup4.
But, my final list with all the results ends up with duplicated data. All the results are the same, and I have no idea why.
Here is my code:
import json
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
url = 'https://www.amazon.com.br/'
option = Options()
option.headless = True
driver = webdriver.Firefox(options=option)
driver.get(url)
driver.find_element_by_id('twotabsearchtextbox').send_keys('teclado mecânico')
driver.find_element_by_id('nav-search-submit-button').click()
products_html = driver.find_elements_by_xpath("//div[@class='a-section a-spacing-medium']")
products_list = [{'title': '', 'image': '', 'price': '', 'currency': ''}] * len(products_html)
for i in range(len(products_list)):
html_content = products_html[i].get_attribute('innerHTML')
soup = BeautifulSoup(html_content, 'lxml')
title = soup.find('span', class_='a-size-base-plus a-color-base a-text-normal')
image = soup.find('img', class_='s-image')
price = soup.find('span', class_='a-price-whole')
decimal = soup.find('span', class_='a-price-fraction')
currency = soup.find('span', class_='a-price-symbol')
products_list[i]['title'] = title.text if title else ''
products_list[i]['image'] = image['src'] if image else ''
products_list[i]['price'] = price.text + decimal.text if price else ''
products_list[i]['currency'] = currency.text if currency else ''
driver.quit()
with open('data.json', 'w') as data:
json.dump(products_list, data, indent=4)
A few lines of my json file:
[
{
"title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
"image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
"price": "732,00",
"currency": "R$"
},
{
"title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
"image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
"price": "732,00",
"currency": "R$"
},
{
"title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
"image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
"price": "732,00",
"currency": "R$"
},
{
"title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
"image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
"price": "732,00",
"currency": "R$"
},
{
"title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
"image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
"price": "732,00",
"currency": "R$"
},
As you can see, the json is full with the same data.
Upvotes: 1
Views: 49
Reputation: 54733
When you create product_list
like that, you are not creating N different dictionaries. You are creating a list with N references to a single dictionary. When you modify any of them, you're modifying all of them.
You should create product_list as empty:
product_list = []
and then append a new dictionary each time.
products_list.append({
'title': title.text if title else '',
'image': image['src'] if image else '',
'price': price.text + decimal.text if price else '',
'currency': currency.text if currency else ''
})
Upvotes: 1