Reputation: 1
I want to request the page every once in a while to determine if the content has been updated, but my own callback function isn't being triggered My allowed_domains and request url are
allowed_domains = ['www1.hkexnews.hk']
start_urls = 'https://www1.hkexnews.hk/search/predefineddoc.xhtml?lang=zh&predefineddocuments=9'
The code for the parsing part is
#Crawl all data first at each start
def parse(self, response):
Total_records = int(re.findall("\d+",response.xpath("//div[@class='PD-TotalRecords']/text()").extract()[0])[0])
dict = {}
is_Latest = True
global Latest_info
global previous_hash
for i in range(1, Total_records + 1):
content = response.xpath("//table/tbody/tr[{}]//text()".format(i)).extract()
# Use the group function to group the list by key
result = list(group(content, self.keys))
Time = dict['Time'] = result[0].get(self.keys[0])
Code = dict['Code'] = result[1].get(self.keys[1])
dict['Name'] = result[2].get(self.keys[2])
if is_Latest:
Latest_info = str(Time) + " | " + str(Code)
is_Latest = False
yield dict
previous_hash = get_hash(Latest_info.encode('utf-8'))
#Monitor data updates and crawl for new data
while True:
time.sleep(10)
# Request website content and calculate hash values
yield scrapy.Request(url=self.start_urls, callback=self.parse_check, dont_filter=True)
My own callback function is
def parse_check(self, response):
global previous_hash
global Latest_info
dict = {}
content = response.xpath("//table/tbody/tr[1]//text()").extract()
# Use the group function to group the list by key
result = list(group(content, self.keys))
Time = result[0].get(self.keys[0])
Code = result[1].get(self.keys[1])
current_info = str(Time) + " | " + str(Code)
current_hash = get_hash(current_info.encode('utf-8'))
# Compare hash values to determine if website content is updated
if current_hash != previous_hash:
dict['Time'] = Time
dict['Code'] = Code
dict['Name'] = result[2].get(self.keys[2])
previous_hash = current_hash
Latest_info = current_info
yield dict
I tried to output the errback but there was no content, after that I tried to request the page with requests.get instead of yield scrapy.Request and that worked, but I still don't know why my callback function is not working
Upvotes: 0
Views: 56
Reputation: 1
I know why, at least this works for me, and that is to try not to use time.sleep in scrapy. because it will block the Twisted reactor (the underlying framework of Scrapy), which will completely block your Scrapy spiders and stop all Scrapy concurrency features. You can use the DOWNLOAD_DELAY function or use the AutoThrottle AutoThrottle
Upvotes: 0