Reputation: 127
I have a Scrapy spider that I've generated. The purpose of the spider is to return network data for the purposes of graphing the network as well as to return the html files for each page the spider reaches. The spider is achieving the first goal but not the second. It results in a csv file with the tracking information but I cannot see that it is saving the html files.
# -*- coding: utf-8 -*-
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.utils.url import urljoin_rfc
from sitegraph.items import SitegraphItem
class CrawlSpider(CrawlSpider):
name = "example"
custom_settings = {
'DEPTH_LIMIT': '1',
}
allowed_domains = []
start_urls = (
'http://exampleurl.com',
)
rules = (
Rule(LinkExtractor(allow=r'/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
i = SitegraphItem()
i['url'] = response.url
# i['http_status'] = response.status
llinks=[]
for anchor in hxs.select('//a[@href]'):
href=anchor.select('@href').extract()[0]
if not href.lower().startswith("javascript"):
llinks.append(urljoin_rfc(response.url,href))
i['linkedurls'] = llinks
return i
def parse(self, response):
filename = response.url.split("/")[-1] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
The traceback I receive is as follows:
Traceback (most recent call last):
File "...\Anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://externalurl.com/> (failed 3 times): TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.core.scraper] ERROR: Error downloading <GET http://externalurl.com/>
Traceback (most recent call last):
File "...\Anaconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..
2019-07-23 14:16:41 [scrapy.core.engine] INFO: Closing spider (finished)
2019-07-23 14:16:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (153 items) in: exampledomainlevel1.csv
Upvotes: 1
Views: 1017
Reputation: 3561
parse
method:
According to scrapy docs and another stack overflow question it is not recommended to override parse
method because crawlspider use it to implement it's logic.
If You need to override parse
method and in the same time count with
Crawlspider.parse
original source code - You need to add it's original source to fix parse
method:
def parse(self, response):
filename = response.url.split("/")[-1] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
csv feed:
This log line:
2019-07-23 14:16:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (153 items) in: exampledomainlevel1.csv
- means that csv feedexporter
enabled (probably in settings.py
project settings file.)
UPDATE
I observed Crawlspider
source code again.
It looks like parse
method called only once at the beginning and it don't cover all web responses.
If my theory correct - after adding this function to your spider class should save all html responses:
def _response_downloaded(self, response):
filename = response.url.split("/")[-1] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
rule = self._rules[response.meta['rule']]
return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
Upvotes: 1