Reputation: 564
I am using scrapy 0.24 to scrape data from a website. However, I am unable to make any requests from my callback method parse_summary
.
class ExampleSpider(scrapy.Spider):
name = "tfrrs"
allowed_domains = ["example.org"]
start_urls = (
'http://www.example.org/results_search.html?page=0&sport=track&title=1&go=1',
)
def __init__(self, *args, **kwargs):
super(TfrrsSpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://www.example.org/results_search.html?page=0&sport=track'&title=1&go=1',]
pass
# works without issue
def parse(self, response):
races = response.xpath("//table[@width='100%']").xpath(".//a[starts-with(@href, 'http://www.tfrrs.org/results/')]/@href").extract()
callback = self.parse_trackfieldsummary
for race in races:
yield scrapy.Request(race, callback=self.parse_summary)
pass
# works without issue
def parse_summary(self, response):
baseurl = 'http://www.example.org/results/'
results = response.xpath("//div[@class='data']").xpath('.//a[@style="margin-left: 20px;"]/@href').extract()
for result in results:
print(baseurl+result) # shows that url is correct everytime
yield scrapy.Request(baseurl+result, callback=self.parse_compiled)
# is never fired or shown in terminal
def parse_compiled(self, response):
print('test')
results = response.xpath("//table[@style='width: 935px;']")
print(results)
When I intentially make the request in parse_summary
fail (due to domain errors, etc), I am able to see the error in the prompt, but when I use the correct url, its as if I am not even calling it. I have also tested the urls being requested in parse_summary
in the parse
method, where they work as expected. What could be causing them to not be fired in the parse_summary
method but successfully in the parse method
? Thank you for your help in advance.
After making some changes to my Spider
, I still have the same result. However, it works if I use an entirely new project. So I am guessing it has to do with my project settings.
Here are my project settings (where raceretrieval
is the name of my project):
BOT_NAME = 'raceretrieval'
DOWNLOAD_DELAY= 1
CONCURRENT_REQUESTS = 100
SPIDER_MODULES = ['raceretrieval.spiders']
NEWSPIDER_MODULE = 'raceretrieval.spiders'
ITEM_PIPELINES = {
'raceretrieval.pipelines.RaceValidationPipeline':1,
'raceretrieval.pipelines.RaceDistanceValidationPipeline':2,
# 'raceretrieval.pipelines.RaceUploadPipeline':9999
}
If I comment out both DOWNLOAD_DELAY= 1
and CONCURRENT_REQUESTS = 100
, the spider works as expected. Why could this be? I don't understand how they would effect this.
Upvotes: 7
Views: 8059
Reputation: 7
These should fix the issue you were having
find / -type d -name "__pycache__" -delete 2>/dev/null
find / -name '*.pyc' -delete
find / -name '*.egg'
Edit:
if that doesn't solve it then the issue may actually be that the download delay is literally backlogging the last request which will be yielded eventually, just in a very long time ^^
Upvotes: -3
Reputation: 5814
I corrected few typos and set correctly the allowed domains and parse_summary seems to work fine. Urls are extracted and parse_compile results are correctly shown in the terminal.
Output are lines as the following:
2014-12-29 12:19:05+0100 [example] DEBUG: Crawled (200) <GET
http://www.tfrrs.org/results/36288_f.html> (referer:
http://www.tfrrs.org/results/36288.html) <200
http://www.tfrrs.org/results/36288_f.html>
[<Selector xpath="//table[@style='width: 935px;']" data=u'<table width="0" border="0" cellspacing='>, <Selector xpath="//table[@style='width: 935px;']" data=u'<table width="0" border="0" cellspacing='> .....
Here it is the corrected code:
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["tfrrs.org"]
start_urls = (
'http://www.tfrrs.org/results_search.html?page=0&sport=track&title=1&go=1',
)
def __init__(self, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://www.tfrrs.org/results_search.html?page=0&sport=track&title=1&go=1',]
# works without issue
def parse(self, response):
races = response.xpath("//table[@width='100%']").xpath(".//a[starts-with(@href, 'http://www.tfrrs.org/results/')]/@href").extract()
#callback = self.parse_trackfieldsummary
for race in races:
yield scrapy.Request(race, callback=self.parse_summary)
pass
# works without issue
def parse_summary(self, response):
baseurl = 'http://www.tfrrs.org/results/'
results = response.xpath("//div[@class='data']").xpath('.//a[@style="margin-left: 20px;"]/@href').extract()
for result in results:
#print(baseurl+result) # shows that url is correct everytime
yield scrapy.Request(baseurl+result, callback=self.parse_compiled)
# is never fired or shown in terminal
def parse_compiled(self, response):
print(response)
results = response.xpath("//table[@style='width: 935px;']")
print(results)
Upvotes: 6