SyrixGG
SyrixGG

Reputation: 57

Scrapy crawls duplicate data

unfortunately I currently have a problem with Scrapy. I am still new to Scrapy and would like to scrap information on Rolex watches. I started with the site Watch.de, where I first go through the Rolex site and want to open the individual watches to get the exact information. However, when I start the crawler I see that many watches are crawled several times. I assume that these are the watches from the "Recently viewed" and "Our new arrivals" points. Is there a way to ignore these duplicates?

that's my code

class WatchbotSpider(scrapy.Spider):
name = 'watchbot'
start_urls = ['https://www.watch.de/germany/rolex.html']

    def parse(self, response, **kwargs):
    for link in response.css('div.product-item-link a::attr(href)'):
        yield response.follow(link.get(), callback=self.parse_categories)
        def parse_categories(self, response):
    
    for product in response.css('div.product-item-link'):
        yield {
            'id': product.css('span.product-item-id.product-item-ref::text').get(),
            'brand': product.css('div.product-item-brand::text').get(),
            'model': product.css('div.product-item-model::text').get(),
            'price': product.css('span.price::text').get(),
            'year': product.css('span.product-item-id.product-item-year::text').get()

Upvotes: 3

Views: 79

Answers (2)

Md. Fazlul Hoque
Md. Fazlul Hoque

Reputation: 16187

Because You are iterating each item link that's why no need to reiterate meaning for lop again and you have to bring to individual page along with individual url and from that page you have to select your desired data items.

Code:

import scrapy
class WatchbotSpider(scrapy.Spider):
    name = 'watchbot'
    start_urls = ['https://www.watch.de/germany/rolex.html']

    def parse(self, response, **kwargs):
        for link in response.css('div.product-item-link a::attr(href)'):
            url = link.get()
            yield scrapy.Request(url, callback=self.parse_categories)
    def parse_categories(self, response):
        yield {
            'id': response.css('div.product-ref-item.product-sku.col-auto>span[itemprop="sku"]::text').get(),
            'product_name': response.css('h1.product-name::text').get().strip(),
            'price': response.xpath('(.//span[@class="price"])[1]/text()').get().replace('\xa0€',' '),
            'year': response.xpath('.//*[@class="product-item-date product-item-option"]/span/text()').get()}

Output:

{'id': '10000060888', 'product_name': 'Rolex Datejust   - Edelstahl - Armband  Edelstahl / Oyster 
- 31mm - Ungetragen', 'price': '8.118 ', 'year': '2021'}
2021-10-26 18:42:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-oyster-perpetual-date-stahl-weissgold-diamanten-automatik-armband-oyster-34mm-ref-115234-vintage-bj-2021-box-pap-full-set-ungetragen-verklebt.html>
{'id': '10000060571', 'product_name': 'Rolex Oyster Perpetual Date Diamanten   - Stahl / Weißgold 
- Armband  Edelstahl / Oyster - 34mm - Ungetragen - Vintage', 'price': '11.990 ', 'year': '2021'} 
2021-10-26 18:42:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-sky-dweller-stahl-gelbgold-automatik-armband-stahl-gelbgold-jubile-42mm-ref-326933-bj-2021-box-pap-full-set-ungetragen-neuheit-2021.html> (referer: https://www.watch.de/germany/rolex.html)    
2021-10-26 18:42:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-sky-dweller-stahl-gelbgold-automatik-armband-stahl-gelbgold-jubile-42mm-ref-326933-bj-2021-box-pap-full-set-ungetragen-neuheit-2021.html>
{'id': '10000060597', 'product_name': 'Rolex Sky-Dweller  - Stahl / Gelbgold - Armband  Stahl / Gelbgold / Jubilé - 42mm - Ungetragen', 'price': '24.924 ', 'year': '2021'}
2021-10-26 18:42:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-yacht-master-ii-stahl-rosegold-everose-automatik-chronograph-44mm-ref-116681-bj-2021-box-pap-full-set-ungetragen.html> (referer: https://www.watch.de/germany/rolex.html)
2021-10-26 18:42:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-datejust-stahl-automatik-armband-oyster-36mm-ref-126200-box-pap-lc-eu-full-set-wie-neu.html> (referer: https://www.watch.de/germany/rolex.html)
2021-10-26 18:42:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-yacht-master-ii-stahl-rosegold-everose-automatik-chronograph-44mm-ref-116681-bj-2021-box-pap-full-set-ungetragen.html>
{'id': '10000060591', 'product_name': 'Rolex Yacht-Master II  - Stahl / Roségold - Armband  Stahl 
/ Roségold / Oyster - 44mm - Ungetragen', 'price': '27.979 ', 'year': '2021'}
2021-10-26 18:42:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-datejust-stahl-automatik-armband-oyster-36mm-ref-126200-box-pap-lc-eu-full-set-wie-neu.html>  
{'id': '10000060580', 'product_name': 'Rolex Datejust  - Edelstahl - Armband  Edelstahl / Oyster - 36mm - Wie neu', 'price': '8.080 ', 'year': '2015'}
2021-10-26 18:42:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-datejust-medium-stahl-automatik-armband-oyster-31mm-ref-278240-bj-2021-box-pap-full-set-ungetragen-60579.html> (referer: https://www.watch.de/germany/rolex.html)
2021-10-26 18:42:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-datejust-stahl-automatik-armband-oyster-41mm-ref-126300-box-pap-full-set-ungetragen-60647.html> (referer: https://www.watch.de/germany/rolex.html)
2021-10-26 18:42:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-datejust-medium-stahl-automatik-armband-oyster-31mm-ref-278240-bj-2021-box-pap-full-set-ungetragen-60579.html>
{'id': '10000060579', 'product_name': 'Rolex Datejust   - Edelstahl - Armband  Edelstahl / Oyster 
- 31mm - Ungetragen', 'price': '8.118 ', 'year': '2021'}
2021-10-26 18:42:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-submariner-stahl-keramik-automatik-armband-oyster-40mm-ref-114060-bj-2015-box-pap-lc100-full-set-wie-neu.html> (referer: https://www.watch.de/germany/rolex.html)
2021-10-26 18:42:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-datejust-stahl-automatik-armband-oyster-41mm-ref-126300-box-pap-full-set-ungetragen-60647.html>
{'id': '10000060647', 'product_name': 'Rolex Datejust  - Edelstahl - Armband  Edelstahl / Oyster - 41mm - Ungetragen', 'price': '9.999 ', 'year': '2021'}
2021-10-26 18:42:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-submariner-stahl-keramik-automatik-armband-oyster-40mm-ref-114060-bj-2015-box-pap-lc100-full-set-wie-neu.html>
{'id': '10000060625', 'product_name': 'Rolex Submariner  - Edelstahl - Armband  Edelstahl / Oyster - 40mm - Wie neu', 'price': '12.525 ', 'year': '2015'}
2021-10-26 18:42:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-oyster-perpetual-tiffany-stahl-automatik-armband-oyster-41mm-ref-124300-box-pap-full-set-ungetragen-60598.html> (referer: https://www.watch.de/germany/rolex.html)
2021-10-26 18:42:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-oyster-perpetual-tiffany-stahl-automatik-armband-oyster-41mm-ref-124300-box-pap-full-set-ungetragen-60598.html>
{'id': '10000060598', 'product_name': 'Rolex Oyster Perpetual  Tiffany - Edelstahl - Armband  Edelstahl / Oyster - 41mm - Ungetragen', 'price': '16.565 ', 'year': '2021'}
2021-10-26 18:42:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-datejust-41-stahl-weissgold-automatik-armband-jubile-41mm-ref-126334-box-pap-full-set-ungetragen-60614.html> (referer: https://www.watch.de/germany/rolex.html)
2021-10-26 18:42:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-sea-dweller-red-4000-stahl-keramik-automatik-armband-oyster-43mm-ref-126600-bj-2018-box-pap-lc-eu-full-set-wie-neu.html> (referer: https://www.watch.de/germany/rolex.html)
2021-10-26 18:42:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-datejust-41-stahl-weissgold-automatik-armband-jubile-41mm-ref-126334-box-pap-full-set-ungetragen-60614.html>

and .. so on

Upvotes: 1

Damir Devetak
Damir Devetak

Reputation: 762

This works,

import scrapy


class WatchbotSpider(scrapy.Spider):

    name = 'watchbot'
    start_urls = ['https://www.watch.de/germany/rolex.html']

    def parse(self, response, **kwargs):

        for link in response.css('div.product-item-link a::attr(href)'):
          yield response.follow(link.get(), callback=self.parse_categories)

    def parse_categories(self, response):

           Dict =  {
               'id':    response.xpath('//div[@class="product-ref-item product-ref d-flex align-items-center"]/span/text()').get(),
               'brand': response.css('div.product-item-brand::text').get(),
               'model': response.xpath('//h1[@class="product-name"]/text()').get(),
               'price': response.css('span.price::text').get().replace(u'\xa0', u' '),
               'year':  response.xpath('//div[@class="product-item-date product-item-option"]/span/text()').get(),
               }
  
           print(Dict)
           yield Dict
 
scrapy crawl watchbot > log

In log,

{'id': '278240', 'brand': 'Rolex ', 'model': 'Rolex Datejust   - Edelstahl - Armband  Edelstahl / Oyster - 31mm - Ungetragen            ', 'price': '8.118 €', 'year': '2021'}
{'id': '116201', 'brand': 'Rolex', 'model': 'Rolex Datejust   - Stahl / Roségold - Armband  Stahl / Roségold / Oyster - 36mm - Wie neu            ', 'price': '14.545 €', 'year': '2018'}
{'id': '126622', 'brand': 'Rolex', 'model': 'Rolex Yacht-Master  - Stahl / Platin - Armband  Edelstahl / Oyster - 40mm - Ungetragen            ', 'price': '15.995 €', 'year': '2020'}
{'id': '124300', 'brand': 'Rolex', 'model': 'Rolex Oyster Perpetual   - Edelstahl - Armband  Edelstahl / Oyster - 41mm - Ungetragen            ', 'price': '9.898 €', 'year': '2021'}
{'id': '116500LN', 'brand': 'Rolex', 'model': 'Rolex Daytona  - Edelstahl - Armband  Edelstahl / Oyster - 40mm - Wie neu            ', 'price': '33.999 €', 'year': '2020'}
{'id': '115234', 'brand': 'Rolex', 'model': 'Rolex Oyster Perpetual Date Diamanten   - Stahl / Weißgold - Armband  Edelstahl / Oyster - 34mm - Ungetragen - Vintage             ', 'price': '11.990 €', 'year': '2021'}
{'id': '126200', 'brand': 'Rolex', 'model': 'Rolex Datejust  - Edelstahl - Armband  Edelstahl / Jubilé - 36mm - Ungetragen            ', 'price': '9.595 €', 'year': '2021'}
{'id': '126333 ', 'brand': 'Rolex', 'model': 'Rolex Datejust   - Stahl / Gelbgold - Armband  Stahl / Gelbgold / Jubilé - 41mm - Wie neu            ', 'price': '15.959 €', 'year': '2021'}
{'id': '126334 ', 'brand': 'Rolex', 'model': 'Rolex Datejust Wimbledon  - Stahl / Weißgold - Armband  Edelstahl / Oyster - 41mm - Ungetragen            ', 'price': '13.399 €', 'year': '2021'}
{'id': '278240', 'brand': 'Rolex', 'model': 'Rolex Datejust   - Edelstahl - Armband  Edelstahl / Oyster - 31mm - Ungetragen            ', 'price': '8.118 €', 'year': '2021'}
.
.
.

Formating replace(" ", "") will cause some exceptions so careful formatting is the next step.

Upvotes: 0

Related Questions