Gabriel Dias
Gabriel Dias

Reputation: 79

Webscraping: Problem with dictionary inside list, json with duplicated data

I'm trying to webscrape Amazon's website to get data about their products. I'm getting the name, price, and currency of the product through Selenium Firefox and BeautifulSoup4.

But, my final list with all the results ends up with duplicated data. All the results are the same, and I have no idea why.

Here is my code:

import json
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

url = 'https://www.amazon.com.br/'

option = Options()
option.headless = True
driver = webdriver.Firefox(options=option)

driver.get(url)

driver.find_element_by_id('twotabsearchtextbox').send_keys('teclado mecânico')
driver.find_element_by_id('nav-search-submit-button').click()

products_html = driver.find_elements_by_xpath("//div[@class='a-section a-spacing-medium']")
products_list = [{'title': '', 'image': '', 'price': '', 'currency': ''}] * len(products_html)

for i in range(len(products_list)):
    html_content = products_html[i].get_attribute('innerHTML')
    soup = BeautifulSoup(html_content, 'lxml')
    
    title = soup.find('span', class_='a-size-base-plus a-color-base a-text-normal')
    image = soup.find('img', class_='s-image')
    price = soup.find('span', class_='a-price-whole')
    decimal = soup.find('span', class_='a-price-fraction')
    currency = soup.find('span', class_='a-price-symbol')

    products_list[i]['title'] = title.text if title else ''
    products_list[i]['image'] = image['src'] if image else ''
    products_list[i]['price'] = price.text + decimal.text if price else ''
    products_list[i]['currency'] = currency.text if currency else ''

driver.quit()

with open('data.json', 'w') as data:
    json.dump(products_list, data, indent=4)

A few lines of my json file:

[
    {
        "title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
        "image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
        "price": "732,00",
        "currency": "R$"
    },
    {
        "title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
        "image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
        "price": "732,00",
        "currency": "R$"
    },
    {
        "title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
        "image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
        "price": "732,00",
        "currency": "R$"
    },
    {
        "title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
        "image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
        "price": "732,00",
        "currency": "R$"
    },
    {
        "title": "ANNE PRO 2, teclado mec\u00e2nico 60% com fio/sem fio (interruptor teron marrom/capa branca) \u2013 teclas completas program\u00e1veis \u2013 Verdadeiro RGB retroiluminado \u2013 Teclas de seta \u2013 Teclas PBT de disparo duplo \u2013 NKRO \u2013 Bateria de 1900 mAh",
        "image": "https://m.media-amazon.com/images/I/61ET53wJ9-L._AC_UL320_.jpg",
        "price": "732,00",
        "currency": "R$"
    },

As you can see, the json is full with the same data.

Upvotes: 1

Views: 49

Answers (1)

Tim Roberts
Tim Roberts

Reputation: 54733

When you create product_list like that, you are not creating N different dictionaries. You are creating a list with N references to a single dictionary. When you modify any of them, you're modifying all of them.

You should create product_list as empty:

product_list = []

and then append a new dictionary each time.

products_list.append({
    'title': title.text if title else '',
    'image': image['src'] if image else '',
    'price':  price.text + decimal.text if price else '',
    'currency': currency.text if currency else ''
})

Upvotes: 1

Related Questions