Reputation: 28571
I have a scrapy spider (code at this gist) which seems to run fine, apart from the fact that it suddenly stops for no apparent reason. When it stops, the last bit of the log file is:
2012-12-28 23:42:04+0000 [church] DEBUG: Crawled (200) <GET http://www.achurchnearyou.com/cogges-st-mary/> (referer: http://www.achurchnearyou.com/clifton-reynes-st-mary-the-virgin/)
2012-12-28 23:42:04+0000 [church] DEBUG: Scraped from <200 http://www.achurchnearyou.com/cogges-st-mary/>
{'archdeaconry': u'OXFORD',
'archdeaconry_id': u'271',
'benefice': u'Cogges and S Leigh',
'benefice_id': u'27',
'deanery': u'WITNEY',
'deanery_id': u'27109',
'legal_name': u'Cogges',
'parish_id': u'270245'}
2012-12-28 23:42:04+0000 [church] DEBUG: Redirecting (301) to <GET http://www.achurchnearyou.com//> from <GET http://www.achurchnearyou.com/venue.php?V=0083>
2012-12-28 23:42:04+0000 [church] INFO: Closing spider (finished)
Is there any reason that a spider might decide it is finished straight after redirecting a URL? The interesting thing is that I have some custom DownloaderMiddleware which will catch a redirect like this and create a new request instead (basically some URLs that I'm trying will redirect to the homepage, and I want to ignore those and create a different URL instead).
Upvotes: 2
Views: 1985
Reputation: 1508
Well..
Looked at your code (seems clean) but I think the error is simpler (still do not know why you started with the initial id = 63..)
But reverse engineering your task. the simples answer is:
if you go to http://www.achurchnearyou.com/send_message.php?venue_id=82 it works. but if try http://www.achurchnearyou.com/send_message.php?venue_id=83
(note the id 82 vs 83)
the NAME of the parish 'disappear' the same if other functions.
The reason you are getting your redirect is because, instead of showing a 404 file not found, the CMS/Website is redirecting you to the home page.
Upvotes: 1