
Reputation: 455

No error just DEBUG: Crawled (200) and referer: None)

I was trying to scrape some data from a Korean web page but failed to do so. No data is scraped at all though the xpath query is working fine in the browser filter. Here is my Python snippet. Thank you for your help. p.s. The snippets are edited according to the advice from @Alexander.

import scrapy 
class CoursesSpider(scrapy.Spider):
    name = 'courses'
    allowed_domains = ['']
    start_urls = ['']

    def parse(self, response):
        for course in response.xpath("//section[@id='course']//ul/li"):
                'title': course.xpath("./h2/text()").get(),
                'hours': course.xpath("./div/strong/text()").get(),
                'content': course.xpath("./div/p/text()").get()

The debug log is:

   2022-12-09 20:15:18 [scrapy.utils.log] INFO: Scrapy 1.6.0 started
   (bot: codealive) 2022-12-09 20:15:18 [scrapy.utils.log] INFO:
   Versions: lxml, libxml2 2.9.12, cssselect 1.2.0, parsel
   1.7.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.7.15 (default, Nov 24 2022, 12:02:37) - [Clang 14.0.6 ], pyOpenSSL 22.0.0 (OpenSSL 1.1.1s 
   1 Nov 2022), cryptography 38.0.2, Platform
   Darwin-22.1.0-x86_64-i386-64bit 2022-12-09 20:15:18 [scrapy.crawler]
   INFO: Overridden settings: {'BOT_NAME': 'codealive',
   'NEWSPIDER_MODULE': 'codealive.spiders', 'SPIDER_MODULES':
   ['codealive.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0;
   Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.372'}
   2022-12-09 20:15:18 [scrapy.extensions.telnet] INFO: Telnet Password:
   35b6e238174899c0 2022-12-09 20:15:18 [scrapy.middleware] INFO:
   Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 
   'scrapy.extensions.logstats.LogStats'] 2022-12-09 20:15:18
   [scrapy.middleware] INFO: Enabled downloader middlewares:
   'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2022-12-09
   20:15:18 [scrapy.middleware] INFO: Enabled spider middlewares:
   'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2022-12-09 20:15:18
   [scrapy.middleware] INFO: Enabled item pipelines: [] 2022-12-09
   20:15:18 [scrapy.core.engine] INFO: Spider opened 2022-12-09 20:15:18
   [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
   scraped 0 items (at 0 items/min) 2022-12-09 20:15:18
   [scrapy.extensions.telnet] INFO: Telnet console listening on 2022-12-09 20:15:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET> (referer: None)
   2022-12-09 20:15:18 [scrapy.core.engine] INFO: Closing spider
   (finished) 2022-12-09 20:15:18 [scrapy.statscollectors] INFO: Dumping
   Scrapy stats: {'downloader/request_bytes': 280, 
   'downloader/request_count': 1, 
   'downloader/request_method_count/GET': 1, 
   'downloader/response_bytes': 9694,  'downloader/response_count': 1, 
   'downloader/response_status_count/200': 1,  'finish_reason':
   'finished',  'finish_time': datetime.datetime(2022, 12, 9, 11, 15,
   18, 903893),  'log_count/DEBUG': 1,  'log_count/INFO': 9, 
   'memusage/max': 58916864,  'memusage/startup': 58916864, 
   'response_received_count': 1,  'scheduler/dequeued': 1, 
   'scheduler/dequeued/memory': 1,  'scheduler/enqueued': 1, 
   'scheduler/enqueued/memory': 1,  'start_time':
   datetime.datetime(2022, 12, 9, 11, 15, 18, 730596)} 2022-12-09
   20:15:18 [scrapy.core.engine] INFO: Spider closed (finished)

Upvotes: 1

Views: 156

Answers (1)


Reputation: 17365

You were close... I think the main issue is that you have a parse function inside of your parse function although I am not certain that isn't just a typo from copying and pasting your code.

Couple of other points, whenever there is an @id tag you should always take advantage of it since typically they are unique.

It's also better for readability when you make the xpath selectors as simple as possible.

For example:

import scrapy
class CoursesSpider(scrapy.Spider):
    name = 'courses'
    allowed_domains = ['']
    start_urls = ['']
    def parse(self, response):
        for course in response.xpath("//section[@id='course']//ul/li"):
                'title': course.xpath("./h2/text()").get(),
                'hours': course.xpath("./div/strong/text()").get(),


2022-12-09 05:08:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-09 05:08:40 [scrapy.extensions.telnet] INFO: Telnet console listening on
2022-12-09 05:08:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET> (referer: None)
2022-12-09 05:08:41 [scrapy.core.scraper] DEBUG: Scraped from <200>
{'title': 'Power Base', 'hours': '24 Lessons x 100min', 'content': 'Python의 기본문법'}
2022-12-09 05:08:41 [scrapy.core.scraper] DEBUG: Scraped from <200>
{'title': 'Core Algorithm', 'hours': '24 Lessons x 100min', 'content': '스택(stack), 큐(queue)등'}
2022-12-09 05:08:41 [scrapy.core.scraper] DEBUG: Scraped from <200>
{'title': 'Super AI', 'hours': '24 Lessons x 100min', 'content': '머신러닝 에이전트를 이용한'}
2022-12-09 05:08:41 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-09 05:08:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 296,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 9694,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.808256,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 12, 9, 13, 8, 41, 105546),
 'httpcompression/response_bytes': 32492,
 'httpcompression/response_count': 1,
 'item_scraped_count': 3,
 'log_count/DEBUG': 5,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 12, 9, 13, 8, 40, 297290)}
2022-12-09 05:08:41 [scrapy.core.engine] INFO: Spider closed (finished)

Upvotes: 1

Related Questions