Reputation: 57
I am currently trying to crawl Rolex watches on chrono24 page with the spider called chrono24, but scrapy crawls 0 pages. I have discovered via the shell response.css('div.article-item-container.wt-search-result') that I only get an empty list, but I don't know what the reason is. Output from Scrapy is the following:
2021-12-29 12:44:07 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: watch)
2021-12-29 12:44:07 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.10.0 (v3.10.0:b494f5935c, Oct 4 2021, 14:59:20) [Clang 12.0.5 (clang-1205.0.22.11)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography 36.0.1, Platform macOS-12.0.1-x86_64-i386-64bit
2021-12-29 12:44:07 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-12-29 12:44:07 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'watch',
'NEWSPIDER_MODULE': 'watches.spiders',
'SPIDER_MODULES': ['watches.spiders']}
2021-12-29 12:44:07 [scrapy.extensions.telnet] INFO: Telnet Password: 8d7e7f060b3fa52b
2021-12-29 12:44:07 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2021-12-29 12:44:07 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-12-29 12:44:07 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-12-29 12:44:10 [mysql.connector.connection] DEBUG: # _do_auth(): user: tinae
2021-12-29 12:44:10 [mysql.connector.connection] DEBUG: # _do_auth(): self._auth_plugin:
2021-12-29 12:44:10 [mysql.connector.connection] DEBUG: new_auth_plugin: caching_sha2_password
2021-12-29 12:44:11 [scrapy.middleware] INFO: Enabled item pipelines:
['watches.pipelines.WatchesPipeline']
2021-12-29 12:44:11 [scrapy.core.engine] INFO: Spider opened
2021-12-29 12:44:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-12-29 12:44:11 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-12-29 12:44:11 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-29 12:44:11 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.003334,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 29, 11, 44, 11, 77335),
'log_count/DEBUG': 3,
'log_count/INFO': 10,
'memusage/max': 63164416,
'memusage/startup': 63164416,
'start_time': datetime.datetime(2021, 12, 29, 11, 44, 11, 74001)}
2021-12-29 12:44:11 [scrapy.core.engine] INFO: Spider closed (finished)
And my code for the spider is the following:
import scrapy
import self as self
from scrapy.loader import ItemLoader
from ..items import WatchItem
from ..items import Transaction
class WatchbotSpider(scrapy.Spider):
name = 'watchde'
#alles crawlen
start_urls = ['https://www.watch.de/english/rolex.html?p=%s' % page for page in range(1, 50)]
#start_urls = ['https://www.watch.de/english/rolex.html']
def _parse(self, response, **kwargs):
for link in response.css('div.product-item-link a::attr(href)'):
url = link.get()
yield scrapy.Request(url, callback=self.parse_categories)
def parse_categories(self, response):
for product in response.xpath('//*[@id="main"]'):
l = ItemLoader(item=WatchItem(), selector=product)
l.add_xpath('itemnr', '//*[@id="main"]/div[2]/div[1]/div[1]/div[3]/div[1]/div[1]/span')
l.add_xpath('reference', '//*[@id="main"]/div[2]/div[1]/div[1]/div[3]/div[1]/div[2]/span')
l.add_xpath('year', '//*[@id="main"]/div[2]/div[1]/div[1]/div[3]/div[2]/div[1]/span')
l.add_xpath('sizemm', '//*[@id="main"]/div[2]/div[1]/div[1]/div[3]/div[2]/div[2]/span')
l.add_xpath('brand', '//*[@id="collapse-1"]/div/div[1]/div[2]/div/div[1]/div/div[2]')
l.add_xpath('model', '//div[@itemprop="model"]/text()')
l.add_xpath('materialcase', '//*[@id="collapse-1"]/div/div[2]/div[2]/div/div[2]/div/div[2]')
l.add_xpath('crown', '//*[@id="collapse-1"]/div/div[2]/div[2]/div/div[3]/div/div[2]')
l.add_xpath('dial', '//*[@id="collapse-1"]/div/div[2]/div[2]/div/div[4]/div/div[2]')
l.add_xpath('clock_hand', '//*[@id="collapse-1"]/div/div[2]/div[2]/div/div[5]/div/div[2]')
l.add_xpath('glas', '//*[@id="collapse-1"]/div/div[2]/div[2]/div/div[6]/div/div[2]/text()')
l.add_xpath('bezel', '//*[@id="collapse-1"]/div/div[2]/div[2]/div/div[8]/div/div[2]')
l.add_xpath('weightgg', '//*[@id="collapse-1"]/div/div[2]/div[2]/div/div[9]/div/div[2]')
l.add_xpath('matbracelet', '//*[@id="collapse-1"]/div/div[3]/div[2]/div/div[1]/div/div[2]')
l.add_xpath('clasp', '//*[@id="collapse-1"]/div/div[3]/div[2]/div/div[2]/div/div[2]')
l.add_xpath('lengthbracelet', '//*[@id="collapse-1"]/div/div[3]/div[2]/div/div[3]/div/div[2]')
l.add_xpath('caliber', '//*[@id="collapse-1"]/div/div[4]/div[2]/div/div[1]/div/div[2]')
l.add_xpath('lift', '//*[@id="collapse-1"]/div/div[4]/div[2]/div/div[2]/div/div[2]')
l.add_xpath('stonescount', '//*[@id="collapse-1"]/div/div[4]/div[2]/div/div[3]/div/div[2]')
l.add_xpath('waterres', '//*[@id="collapse-1"]/div/div[5]/div[2]/div/div[1]/div/div[2]')
l.add_xpath('c_condition', '//*[@id="collapse-1"]/div/div[5]/div[2]/div/div[2]/div/div[2]')
l.add_xpath('casenr', '//*[@id="collapse-1"]/div/div[5]/div[2]/div/div[6]/div/div[2]')
yield l.load_item()
class WatchbotSpider2(scrapy.Spider):
name = 'watchde2'
#start_urls = ['https://www.watch.de/english/rolex.html?p=%s' % page for page in range(1, 50)]
start_urls = ['https://www.watch.de/english/rolex.html']
def _parse(self, response, **kwargs):
for link in response.css('div.product-item-link a::attr(href)'):
url = link.get()
yield scrapy.Request(url, callback=self.parse_categories)
def parse_categories(self, response):
for product in response.xpath('//*[@id="main"]'):
l = ItemLoader(item=Transaction(), selector=product)
l.add_xpath('item2', '//*[@id="main"]/div[2]/div[1]/div[1]/div[3]/div[1]/div[1]/span')
l.add_xpath('price', '//*[@id="main"]/div[2]/div[1]/div[1]/div[3]/form/div[1]/div[1]/div[1]/div[2]/div/span/span')
yield l.load_item()
class WatchbotSpider3(scrapy.Spider):
name = 'chrono24'
self.start_urls = ['https://www.chrono24.com/rolex/index.htm']
def _parse(self, response, **kwargs):
for link in response.css('div.article-item-container wt-search-result a::attr(href)'):
url = link.get()
yield scrapy.Request(url, callback=self.parse_categories)
def parse_categories(self, response):
for product in response.xpath('//*[@id="main-content"]'):
l = ItemLoader(item=WatchItem(), selector=product)
# l.add_xpath('itemnr', '/html/body/div[4]/main/div/section[2]/section[1]/div/div[1]/div[1]/table/tbody[1]/tr[5]/td[2]')
l.add_xpath('reference', '//*[@id="jq-specifications"]/div/div[1]/div[1]/table/tbody[1]/tr[5]/td[2]')
l.add_xpath('year', '//*[@id="jq-specifications"]/div/div[1]/div[1]/table/tbody[1]/tr[10]/td[2]')
l.add_xpath('sizemm', '//*[@id="jq-specifications"]/div/div[1]/div[1]/table/tbody[3]/tr[3]/td[2]')
l.add_xpath('brand', '///*[@id="jq-specifications"]/div/div[1]/div[1]/table/tbody[1]/tr[3]/td[2]/a')
l.add_xpath('model', '//*[@id="jq-specifications"]/div/div[1]/div[1]/table/tbody[1]/tr[4]/td[2]/a')
l.add_xpath('materialcase', '//*[@id="jq-specifications"]/div/div[1]/div[1]/table/tbody[1]/tr[8]/td[2]')
# l.add_xpath('crown', '//*[@id="collapse-1"]/div/div[2]/div[2]/div/div[3]/div/div[2]')#
l.add_xpath('dial', '//*[@id="jq-specifications"]/div/div[1]/div[1]/table/tbody[3]/tr[4]/td[2]')
# l.add_xpath('clock_hand', '//*[@id="collapse-1"]/div/div[2]/div[2]/div/div[5]/div/div[2]')#
# l.add_xpath('glas', '//*[@id="collapse-1"]/div/div[2]/div[2]/div/div[6]/div/div[2]/text()')#
l.add_xpath('bezel', '//*[@id="jq-specifications"]/div/div[1]/div[1]/table/tbody[3]/tr[5]/td[2]')
# l.add_xpath('weightgg', '//*[@id="collapse-1"]/div/div[2]/div[2]/div/div[9]/div/div[2]')#
l.add_xpath('matbracelet', '//*[@id="jq-specifications"]/div/div[1]/div[1]/table/tbody[1]/tr[9]/td[2]')
l.add_xpath('clasp', '//*[@id="jq-specifications"]/div/div[1]/div[1]/table/tbody[4]/tr[4]/td[2]')
# l.add_xpath('lengthbracelet', '//*[@id="collapse-1"]/div/div[3]/div[2]/div/div[3]/div/div[2]')#
l.add_xpath('caliber', '//*[@id="jq-specifications"]/div/div[1]/div[1]/table/tbody[2]/tr[4]/td[2]')
l.add_xpath('lift', '//*[@id="jq-specifications"]/div/div[1]/div[1]/table/tbody[2]/tr[2]/td[2]')
l.add_xpath('stonescount', '//*[@id="jq-specifications"]/div/div[1]/div[1]/table/tbody[2]/tr[6]/td[2]')
l.add_xpath('waterres', '//*[@id="jq-specifications"]/div/div[1]/div[1]/table/tbody[3]/tr[4]/td[2]')
l.add_xpath('c_condition', '//*[@id="jq-specifications"]/div/div[1]/div[1]/table/tbody[1]/tr[10]/td[2]')
l.add_xpath('casenr', '//*[@id="collapse-1"]/div/div[5]/div[2]/div/div[6]/div/div[2]')
yield l.load_item()
Here is my Items.py
import scrapy
from itemloaders.processors import TakeFirst, MapCompose
from w3lib.html import remove_tags
class WatchItem(scrapy.Item):
itemnr = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
#vendor = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
reference = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
year = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
sizemm = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
brand = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
model = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
materialcase = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
crown = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
dial = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
clock_hand = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
glas = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
bezel = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
weightgg = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
matbracelet = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
clasp = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
lengthbracelet = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
caliber = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
lift = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
stonescount = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
waterres = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
c_condition = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
casenr = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
class Transaction(scrapy.Item):
item2 = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
price = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
#url = scrapy.Field(input_processor=MapCompose(remove_tags, output_processor=TakeFirst()))
Upvotes: 0
Views: 131
Reputation: 4822
You're getting a relative url, either add the base url or use follow()
class WatchbotSpider3(scrapy.Spider):
name = 'chrono24'
start_urls = ['https://www.chrono24.com/rolex/index.htm']
def _parse(self, response, **kwargs):
for link in response.css('div.article-item-container.wt-search-result a::attr(href)'):
url = link.get()
yield response.follow(url, callback=self.parse_categories)
def parse_categories(self, response):
print(response.url)
Output:
https://www.chrono24.com/about-us.htm
https://www.chrono24.com/rolex/rolex-gmt-master-ii-2001-boxes-papers-pepsi-blue-red-bezel-black-dial-steel--id21614755.htm
https://www.chrono24.com/rolex/rolex-sky-dweller-black-dial-oyster-boxpaperscard-326934-stainless-steel--id21835712.htm
https://www.chrono24.com/rolex/rolex-rolex-sub-no-date-124060-stainless-steel-41mm-ceramic-100-complete-sept-2020--id21975593.htm
https://www.chrono24.com/rolex/rolex-daytona-two-tone-white-dial-18k-gold--116503--id21545861.htm
...
...
...
Also you might want to fix your xpath.
Upvotes: 1
Reputation: 2609
Its should work please change response.css('div.article-item-container wt-search-result a::attr(href)')
to
response.css('div.article-item-container.wt-search-result a::attr(href)')
In [28]: for link in response.css('div.article-item-container.wt-search-result a::attr(href)'):
...: url = link.get()
...: print(response.follow(link))
...:
...:
...:
<GET https://www.chrono24.com/rolex/rolex-datejust-41mm-smooth-blue-stick-dial-jubilee-126300-unworn-2021--id12873173.htm>
<GET https://www.chrono24.com/rolex/rolex-submariner-date-41mm-white-gold-smurf-126619lb-unworn-2021--id20061324.htm>
<GET https://www.chrono24.com/rolex/rolex-oyster-perpetual-41mm-silver-dial-124300-unworn-2020--id16516907.htm>
<GET https://www.chrono24.com/rolex/rolex-rolex-yacht-master-126655-18k-rose-gold-40mm-oysterflex-watch--id21952691.htm>
<GET https://www.chrono24.com/rolex/rolex-rolex-submariner-date-hulk-116610-lv-stainless--green-ceramic-40mm--id21761911.htm>
<GET https://www.chrono24.com/rolex/rolex-submariner-date-126610---unworn---2021---full-set--id21925760.htm>
<GET https://www.chrono24.com/rolex/rolex-submariner-date--id16957528.htm>
<GET https://www.chrono24.com/rolex/rolex-mint-daytona-serviced-by-rolex-beverly-hills-october-2021--id21160547.htm>
<GET https://www.chrono24.com/rolex/rolex-rolex-228235-rose-gold-day-date-olive-green-roman-dial-40mm--id21395066.htm>
<GET https://www.chrono24.com/rolex/rolex-daytona--id21773238.htm>
<GET https://www.chrono24.com/rolex/rolex-platinum-daytona-w-stickers-ice-blue-baguette-boxpaperscard-116506-platona--id21703249.htm>
<GET https://www.chrono24.com/rolex/rolex-gmt-master-ii-root-beer-stainless-steel-and-rose-gold--126711chnr--id21569611.htm>
<GET https://www.chrono24.com/rolex/rolex-daytona-white-gold-blue-dial-116509-2021yr--id21561030.htm>
<GET https://www.chrono24.com/rolex/rolex-rolex-submariner-date-starbucks-126610lv-41mm-new-steel-green-2021--id21716829.htm>
<GET https://www.chrono24.com/rolex/rolex-submariner-124060---full-set---unworn---2021--id21685488.htm>
<GET https://www.chrono24.com/rolex/rolex-gmt-master--id18592027.htm>
<GET https://www.chrono24.com/rolex/rolex-unworn-daytona--tahitian-pearl-dial--box-and-papers--18k-yellow-gold-116518--id20567835.htm>
<GET https://www.chrono24.com/rolex/rolex-rolex-326934--sky-dweller-stainless-steel-blue-stick--dial-oyster-bracelet--42mm--id8591612.htm>
<GET https://www.chrono24.com/rolex/rolex-daytona--id21488646.htm>
<GET https://www.chrono24.com/rolex/rolex-sky-dweller-black-dial-oyster-boxpaperscard-326934-stainless-steel--id21835712.htm>
<GET https://www.chrono24.com/rolex/rolex-datejust-41mm-wimbledon-dial-fluted-jubilee-steel-126334--id21569258.htm>
<GET https://www.chrono24.com/rolex/rolex-gmt-master-ii-black-discontinued-under-warranty--id21947571.htm>
<GET https://www.chrono24.com/rolex/rolex-rolex-yacht-master-ii-116688-18k-yellow-gold-44mm-oyster--id22016146.htm>
<GET https://www.chrono24.com/rolex/rolex-submariner-date-116613lb---bluesy---rare-flat-blue-dial--id21392837.htm>
<GET https://www.chrono24.com/rolex/rolex-daytona--id21716468.htm>
<GET https://www.chrono24.com/rolex/rolex-vintage-26mm-18k-gold-lady-datejust-president-with-box-and-papers--id20765460.htm>
<GET https://www.chrono24.com/rolex/rolex-rolex-228238-day-date-black-baguette-diamond-dial-yellow-gold-40mm--id17173727.htm>
<GET https://www.chrono24.com/rolex/rolex-day-date-36--id21231853.htm>
<GET https://www.chrono24.com/rolex/rolex-daytona-white-dial-two-tone-18kss-boxcardpapers-116503--id21340628.htm>
<GET https://www.chrono24.com/rolex/rolex-daytona-two-tone-white-dial-18k-gold--116503--id21545861.htm>
<GET https://www.chrono24.com/about-us.htm>
<GET https://www.chrono24.com/rolex/rolex-gmt-master-ii-rootbeer-rose-gold-2021--id21947528.htm>
<GET https://www.chrono24.com/rolex/rolex-rolex-red-sea-dweller-43mm-mark-1-50th-anniversary-steel-126600-sd43--id21398965.htm>
<GET https://www.chrono24.com/rolex/rolex-submariner-124060---full-set--id21392883.htm>
<GET https://www.chrono24.com/rolex/rolex-submariner--id21799317.htm>
Upvotes: 1