Scrapy spider finishing early for no apparent reason

Question

I have a scrapy spider (code at this gist) which seems to run fine, apart from the fact that it suddenly stops for no apparent reason. When it stops, the last bit of the log file is:

2012-12-28 23:42:04+0000 [church] DEBUG: Crawled (200)  (referer: http://www.achurchnearyou.com/clifton-reynes-st-mary-the-virgin/)
2012-12-28 23:42:04+0000 [church] DEBUG: Scraped from <200 http://www.achurchnearyou.com/cogges-st-mary/>
    {'archdeaconry': u'OXFORD',
     'archdeaconry_id': u'271',
     'benefice': u'Cogges and S Leigh',
     'benefice_id': u'27',
     'deanery': u'WITNEY',
     'deanery_id': u'27109',
     'legal_name': u'Cogges',
     'parish_id': u'270245'}
2012-12-28 23:42:04+0000 [church] DEBUG: Redirecting (301) to  from 
2012-12-28 23:42:04+0000 [church] INFO: Closing spider (finished)

Is there any reason that a spider might decide it is finished straight after redirecting a URL? The interesting thing is that I have some custom DownloaderMiddleware which will catch a redirect like this and create a new request instead (basically some URLs that I'm trying will redirect to the homepage, and I want to ignore those and create a different URL instead).

Carlos Henrique Cano · Accepted Answer

Well..

Looked at your code (seems clean) but I think the error is simpler (still do not know why you started with the initial id = 63..)

But reverse engineering your task. the simples answer is:

The 'parish' that has the id 83 does not exist or has an error.

if you go to http://www.achurchnearyou.com/send_message.php?venue_id=82 it works. but if try http://www.achurchnearyou.com/send_message.php?venue_id=83

(note the id 82 vs 83)

the NAME of the parish 'disappear' the same if other functions.

The reason you are getting your redirect is because, instead of showing a 404 file not found, the CMS/Website is redirecting you to the home page.

Scrapy spider finishing early for no apparent reason

Answers (1)

Related Questions