robintw
robintw

Reputation: 28571

Scrapy spider finishing early for no apparent reason

I have a scrapy spider (code at this gist) which seems to run fine, apart from the fact that it suddenly stops for no apparent reason. When it stops, the last bit of the log file is:

2012-12-28 23:42:04+0000 [church] DEBUG: Crawled (200) <GET http://www.achurchnearyou.com/cogges-st-mary/> (referer: http://www.achurchnearyou.com/clifton-reynes-st-mary-the-virgin/)
2012-12-28 23:42:04+0000 [church] DEBUG: Scraped from <200 http://www.achurchnearyou.com/cogges-st-mary/>
    {'archdeaconry': u'OXFORD',
     'archdeaconry_id': u'271',
     'benefice': u'Cogges and S Leigh',
     'benefice_id': u'27',
     'deanery': u'WITNEY',
     'deanery_id': u'27109',
     'legal_name': u'Cogges',
     'parish_id': u'270245'}
2012-12-28 23:42:04+0000 [church] DEBUG: Redirecting (301) to <GET http://www.achurchnearyou.com//> from <GET http://www.achurchnearyou.com/venue.php?V=0083>
2012-12-28 23:42:04+0000 [church] INFO: Closing spider (finished)

Is there any reason that a spider might decide it is finished straight after redirecting a URL? The interesting thing is that I have some custom DownloaderMiddleware which will catch a redirect like this and create a new request instead (basically some URLs that I'm trying will redirect to the homepage, and I want to ignore those and create a different URL instead).

Upvotes: 2

Views: 1985

Answers (1)

Carlos Henrique Cano
Carlos Henrique Cano

Reputation: 1508

Well..

Looked at your code (seems clean) but I think the error is simpler (still do not know why you started with the initial id = 63..)

But reverse engineering your task. the simples answer is:

  1. The 'parish' that has the id 83 does not exist or has an error.

if you go to http://www.achurchnearyou.com/send_message.php?venue_id=82 it works. but if try http://www.achurchnearyou.com/send_message.php?venue_id=83

(note the id 82 vs 83)

the NAME of the parish 'disappear' the same if other functions.

The reason you are getting your redirect is because, instead of showing a 404 file not found, the CMS/Website is redirecting you to the home page.

Upvotes: 1

Related Questions