Reputation: 336
I have my last question at here : last question
And now I have tried my best to think over and improve on my spider's structure. however, due to some reason(s), my spider still did not manage to start crawling.
I have also checked xpath and they worked ( at chrome console).
I joined the url with the href because href always returns the parameter only. I have attached a sample link format at my last question. ( i want to keep this post awag from getting lengthy)
My Spider:
class kmssSpider(scrapy.Spider):
name='kmss'
start_url = 'https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument#{unid=ADE682E34FC59D274825770B0037D278}'
login_page = 'https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login'
allowed_domain = ["kmssqkr.hksarg"]
def start_requests(self):
yield Request(url=self.login_page, callback=self.login ,dont_filter = True
)
def login(self,response):
return FormRequest.from_response(response,formdata={'user':'usename','password':'pw'},
callback = self.check_login_response)
def check_login_response(self,response):
if 'Welcome' in response.body:
self.log("\n\n\n\n Successfuly Logged in \n\n\n ")
yield Request(url=self.start_url,
cookies={'LtpaToken2':'jHxHvqs+NeT...'}
)
else:
self.log("\n\n You are not logged in \n\n " )
def parse(self,response):
listattheleft = response.xpath("*//*[@class='qlist']/li[not(contains(@role,'menuitem'))]")
anyfolder = response.xpath("*//*[@class='q-folderItem']/h4")
anyfile = response.xpath("*//*[@class='q-otherItem']/h4")
for each_tab in listattheleft:
item = CrawlkmssItem()
item['url'] = each_tab.xpath('a/@href').extract()
item['title'] = each_tab.xpath('a/text()').extract()
yield item
if 'unid' not in each_tab.xpath('./a').extract():
parameter = each_tab.xpath('a/@href').extract()
locatetheroom = parameter.find('PageLibrary')
item['room'] = parameter[locatetheroom:]
locatethestart = response.url.find('#',0)
full_url = response.url[:locatethestart] + parameter
yield Request(url=full_url,
cookies={'LtpaToken2':'jHxHvqs+NeT...'}
)
for folder in anyfolder:
folderparameter = folder.xpath('a/@href').extract()
locatethestart = response.url.find('#',0)
folder_url = response.url[:locatethestart]+ folderparameter
yield Request(url=folder_url, callback='parse_folder',
cookies={'LtpaToken2':'jHxHvqs+NeT...'}
)
for File in anyfile:
fileparameter = File.xpath('a/@href').extract()
locatethestart = response.url.find('#',0)
file_url = response.url[:locatethestart] + fileparameter
yield Request(url=file_url, callback='parse_file',
cookies={'LtpaToken2':'jHxHvqs+NeT...'}
)
def parse_folder(self,response):
findfolder = response.xpath("//div[@class='lotusHeader']")
folderitem= CrawlkmssFolder()
folderitem['foldername'] = findfolder.xpath('h1/span/span/text()').extract()
folderitem['url']= response.url[response.url.find("unid=")+5:]
yield folderitem
def parse_file(self,response):
findfile = response.xpath("//div[@class='lotusContent']")
fileitem = CrawlkmssFile()
fileitem['filename']=findfile.xpath('a/text()').extract()
fileitem['title']=findfile.xpath(".//div[@class='qkrTitle']/span/@title").extract()
fileitem['author']=findfile.xpath(".//div[@class='lotusMeta']/span[3]/span/text()").extract()
yield fileitem
The information I intended to crawl:
Left hand side bar:
Folder:
log:
c:\Users\~\crawlKMSS>scrapy crawl kmss
2015-07-28 17:54:59 [scrapy] INFO: Scrapy 1.0.1 started (bot: crawlKMSS)
2015-07-28 17:54:59 [scrapy] INFO: Optional features available: ssl, http11, boto
2015-07-28 17:54:59 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'crawlKMSS.spiders', 'SPIDER_MODULES': ['crawlKMSS.spiders'], 'BOT_NAME': 'crawlKMSS'}
2015-07-28 17:54:59 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'. Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected.
2015-07-28 17:54:59 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-28 17:54:59 [boto] DEBUG: Retrieving credentials from metadata server.
2015-07-28 17:55:00 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 449, in _open
'_open', req)
File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1197, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2015-07-28 17:55:00 [boto] ERROR: Unable to read instance data, giving up
2015-07-28 17:55:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-28 17:55:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-28 17:55:01 [scrapy] INFO: Enabled item pipelines:
2015-07-28 17:55:01 [scrapy] INFO: Spider opened
2015-07-28 17:55:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-28 17:55:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-28 17:55:05 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login> (referer: None)
2015-07-28 17:55:10 [scrapy] DEBUG: Crawled (200) <POST https://kmssqkr..hksarg/names.nsf?Login> (referer: https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login)
2015-07-28 17:55:10 [kmss] DEBUG:
Successfuly Logged in
2015-07-28 17:55:10 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument#%7Bunid=ADE682E34FC59D274825770B0037D278%7D> (referer: https://kmssqkr.hksarg/names.nsf?Login)
2015-07-28 17:55:10 [scrapy] INFO: Closing spider (finished)
2015-07-28 17:55:10 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1636,
would Appreciate any help!
Upvotes: 0
Views: 503
Reputation: 2594
There is a warning in your log and your traceback suggests that the error is raised at opening a httpConnection
.
2015-07-28 17:54:59 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'. Please install it from https://pypi.python.org/pypi/service_identity and make sure all of its dependencies are satisfied. Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected.
Upvotes: 1
Reputation: 31524
I think you are overcomplicating, why are you doing the heavy lifting job inheriting from the class scrapy.Spider
when you have scrapy.Crawler
? A Spider
is normally used to scrape a list of pages, while a Crawler
is used to crawl websites.
This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules.
Upvotes: 1