Reputation: 57
unfortunately I currently have a problem with Scrapy. I am still new to Scrapy and would like to scrap information on Rolex watches. I started with the site Watch.de, where I first go through the Rolex site and want to open the individual watches to get the exact information. However, when I start the crawler I see that many watches are crawled several times. I assume that these are the watches from the "Recently viewed" and "Our new arrivals" points. Is there a way to ignore these duplicates?
that's my code
class WatchbotSpider(scrapy.Spider):
name = 'watchbot'
start_urls = ['https://www.watch.de/germany/rolex.html']
def parse(self, response, **kwargs):
for link in response.css('div.product-item-link a::attr(href)'):
yield response.follow(link.get(), callback=self.parse_categories)
def parse_categories(self, response):
for product in response.css('div.product-item-link'):
yield {
'id': product.css('span.product-item-id.product-item-ref::text').get(),
'brand': product.css('div.product-item-brand::text').get(),
'model': product.css('div.product-item-model::text').get(),
'price': product.css('span.price::text').get(),
'year': product.css('span.product-item-id.product-item-year::text').get()
Upvotes: 3
Views: 79
Reputation: 16187
Because You are iterating each item link that's why no need to reiterate meaning for lop again and you have to bring to individual page along with individual url and from that page you have to select your desired data items.
Code:
import scrapy
class WatchbotSpider(scrapy.Spider):
name = 'watchbot'
start_urls = ['https://www.watch.de/germany/rolex.html']
def parse(self, response, **kwargs):
for link in response.css('div.product-item-link a::attr(href)'):
url = link.get()
yield scrapy.Request(url, callback=self.parse_categories)
def parse_categories(self, response):
yield {
'id': response.css('div.product-ref-item.product-sku.col-auto>span[itemprop="sku"]::text').get(),
'product_name': response.css('h1.product-name::text').get().strip(),
'price': response.xpath('(.//span[@class="price"])[1]/text()').get().replace('\xa0€',' '),
'year': response.xpath('.//*[@class="product-item-date product-item-option"]/span/text()').get()}
Output:
{'id': '10000060888', 'product_name': 'Rolex Datejust - Edelstahl - Armband Edelstahl / Oyster
- 31mm - Ungetragen', 'price': '8.118 ', 'year': '2021'}
2021-10-26 18:42:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-oyster-perpetual-date-stahl-weissgold-diamanten-automatik-armband-oyster-34mm-ref-115234-vintage-bj-2021-box-pap-full-set-ungetragen-verklebt.html>
{'id': '10000060571', 'product_name': 'Rolex Oyster Perpetual Date Diamanten - Stahl / Weißgold
- Armband Edelstahl / Oyster - 34mm - Ungetragen - Vintage', 'price': '11.990 ', 'year': '2021'}
2021-10-26 18:42:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-sky-dweller-stahl-gelbgold-automatik-armband-stahl-gelbgold-jubile-42mm-ref-326933-bj-2021-box-pap-full-set-ungetragen-neuheit-2021.html> (referer: https://www.watch.de/germany/rolex.html)
2021-10-26 18:42:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-sky-dweller-stahl-gelbgold-automatik-armband-stahl-gelbgold-jubile-42mm-ref-326933-bj-2021-box-pap-full-set-ungetragen-neuheit-2021.html>
{'id': '10000060597', 'product_name': 'Rolex Sky-Dweller - Stahl / Gelbgold - Armband Stahl / Gelbgold / Jubilé - 42mm - Ungetragen', 'price': '24.924 ', 'year': '2021'}
2021-10-26 18:42:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-yacht-master-ii-stahl-rosegold-everose-automatik-chronograph-44mm-ref-116681-bj-2021-box-pap-full-set-ungetragen.html> (referer: https://www.watch.de/germany/rolex.html)
2021-10-26 18:42:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-datejust-stahl-automatik-armband-oyster-36mm-ref-126200-box-pap-lc-eu-full-set-wie-neu.html> (referer: https://www.watch.de/germany/rolex.html)
2021-10-26 18:42:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-yacht-master-ii-stahl-rosegold-everose-automatik-chronograph-44mm-ref-116681-bj-2021-box-pap-full-set-ungetragen.html>
{'id': '10000060591', 'product_name': 'Rolex Yacht-Master II - Stahl / Roségold - Armband Stahl
/ Roségold / Oyster - 44mm - Ungetragen', 'price': '27.979 ', 'year': '2021'}
2021-10-26 18:42:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-datejust-stahl-automatik-armband-oyster-36mm-ref-126200-box-pap-lc-eu-full-set-wie-neu.html>
{'id': '10000060580', 'product_name': 'Rolex Datejust - Edelstahl - Armband Edelstahl / Oyster - 36mm - Wie neu', 'price': '8.080 ', 'year': '2015'}
2021-10-26 18:42:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-datejust-medium-stahl-automatik-armband-oyster-31mm-ref-278240-bj-2021-box-pap-full-set-ungetragen-60579.html> (referer: https://www.watch.de/germany/rolex.html)
2021-10-26 18:42:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-datejust-stahl-automatik-armband-oyster-41mm-ref-126300-box-pap-full-set-ungetragen-60647.html> (referer: https://www.watch.de/germany/rolex.html)
2021-10-26 18:42:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-datejust-medium-stahl-automatik-armband-oyster-31mm-ref-278240-bj-2021-box-pap-full-set-ungetragen-60579.html>
{'id': '10000060579', 'product_name': 'Rolex Datejust - Edelstahl - Armband Edelstahl / Oyster
- 31mm - Ungetragen', 'price': '8.118 ', 'year': '2021'}
2021-10-26 18:42:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-submariner-stahl-keramik-automatik-armband-oyster-40mm-ref-114060-bj-2015-box-pap-lc100-full-set-wie-neu.html> (referer: https://www.watch.de/germany/rolex.html)
2021-10-26 18:42:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-datejust-stahl-automatik-armband-oyster-41mm-ref-126300-box-pap-full-set-ungetragen-60647.html>
{'id': '10000060647', 'product_name': 'Rolex Datejust - Edelstahl - Armband Edelstahl / Oyster - 41mm - Ungetragen', 'price': '9.999 ', 'year': '2021'}
2021-10-26 18:42:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-submariner-stahl-keramik-automatik-armband-oyster-40mm-ref-114060-bj-2015-box-pap-lc100-full-set-wie-neu.html>
{'id': '10000060625', 'product_name': 'Rolex Submariner - Edelstahl - Armband Edelstahl / Oyster - 40mm - Wie neu', 'price': '12.525 ', 'year': '2015'}
2021-10-26 18:42:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-oyster-perpetual-tiffany-stahl-automatik-armband-oyster-41mm-ref-124300-box-pap-full-set-ungetragen-60598.html> (referer: https://www.watch.de/germany/rolex.html)
2021-10-26 18:42:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-oyster-perpetual-tiffany-stahl-automatik-armband-oyster-41mm-ref-124300-box-pap-full-set-ungetragen-60598.html>
{'id': '10000060598', 'product_name': 'Rolex Oyster Perpetual Tiffany - Edelstahl - Armband Edelstahl / Oyster - 41mm - Ungetragen', 'price': '16.565 ', 'year': '2021'}
2021-10-26 18:42:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-datejust-41-stahl-weissgold-automatik-armband-jubile-41mm-ref-126334-box-pap-full-set-ungetragen-60614.html> (referer: https://www.watch.de/germany/rolex.html)
2021-10-26 18:42:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.watch.de/germany/rolex-sea-dweller-red-4000-stahl-keramik-automatik-armband-oyster-43mm-ref-126600-bj-2018-box-pap-lc-eu-full-set-wie-neu.html> (referer: https://www.watch.de/germany/rolex.html)
2021-10-26 18:42:56 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.watch.de/germany/rolex-datejust-41-stahl-weissgold-automatik-armband-jubile-41mm-ref-126334-box-pap-full-set-ungetragen-60614.html>
and .. so on
Upvotes: 1
Reputation: 762
This works,
import scrapy
class WatchbotSpider(scrapy.Spider):
name = 'watchbot'
start_urls = ['https://www.watch.de/germany/rolex.html']
def parse(self, response, **kwargs):
for link in response.css('div.product-item-link a::attr(href)'):
yield response.follow(link.get(), callback=self.parse_categories)
def parse_categories(self, response):
Dict = {
'id': response.xpath('//div[@class="product-ref-item product-ref d-flex align-items-center"]/span/text()').get(),
'brand': response.css('div.product-item-brand::text').get(),
'model': response.xpath('//h1[@class="product-name"]/text()').get(),
'price': response.css('span.price::text').get().replace(u'\xa0', u' '),
'year': response.xpath('//div[@class="product-item-date product-item-option"]/span/text()').get(),
}
print(Dict)
yield Dict
scrapy crawl watchbot > log
In log,
{'id': '278240', 'brand': 'Rolex ', 'model': 'Rolex Datejust - Edelstahl - Armband Edelstahl / Oyster - 31mm - Ungetragen ', 'price': '8.118 €', 'year': '2021'}
{'id': '116201', 'brand': 'Rolex', 'model': 'Rolex Datejust - Stahl / Roségold - Armband Stahl / Roségold / Oyster - 36mm - Wie neu ', 'price': '14.545 €', 'year': '2018'}
{'id': '126622', 'brand': 'Rolex', 'model': 'Rolex Yacht-Master - Stahl / Platin - Armband Edelstahl / Oyster - 40mm - Ungetragen ', 'price': '15.995 €', 'year': '2020'}
{'id': '124300', 'brand': 'Rolex', 'model': 'Rolex Oyster Perpetual - Edelstahl - Armband Edelstahl / Oyster - 41mm - Ungetragen ', 'price': '9.898 €', 'year': '2021'}
{'id': '116500LN', 'brand': 'Rolex', 'model': 'Rolex Daytona - Edelstahl - Armband Edelstahl / Oyster - 40mm - Wie neu ', 'price': '33.999 €', 'year': '2020'}
{'id': '115234', 'brand': 'Rolex', 'model': 'Rolex Oyster Perpetual Date Diamanten - Stahl / Weißgold - Armband Edelstahl / Oyster - 34mm - Ungetragen - Vintage ', 'price': '11.990 €', 'year': '2021'}
{'id': '126200', 'brand': 'Rolex', 'model': 'Rolex Datejust - Edelstahl - Armband Edelstahl / Jubilé - 36mm - Ungetragen ', 'price': '9.595 €', 'year': '2021'}
{'id': '126333 ', 'brand': 'Rolex', 'model': 'Rolex Datejust - Stahl / Gelbgold - Armband Stahl / Gelbgold / Jubilé - 41mm - Wie neu ', 'price': '15.959 €', 'year': '2021'}
{'id': '126334 ', 'brand': 'Rolex', 'model': 'Rolex Datejust Wimbledon - Stahl / Weißgold - Armband Edelstahl / Oyster - 41mm - Ungetragen ', 'price': '13.399 €', 'year': '2021'}
{'id': '278240', 'brand': 'Rolex', 'model': 'Rolex Datejust - Edelstahl - Armband Edelstahl / Oyster - 31mm - Ungetragen ', 'price': '8.118 €', 'year': '2021'}
.
.
.
Formating replace(" ", "") will cause some exceptions so careful formatting is the next step.
Upvotes: 0