Scrapy requests - My own callback function is not being called

Question

I want to request the page every once in a while to determine if the content has been updated, but my own callback function isn't being triggered My allowed_domains and request url are

allowed_domains = ['www1.hkexnews.hk']
start_urls = 'https://www1.hkexnews.hk/search/predefineddoc.xhtml?lang=zh&predefineddocuments=9'

The code for the parsing part is

#Crawl all data first at each start
    def parse(self, response):
        Total_records = int(re.findall("\d+",response.xpath("//div[@class='PD-TotalRecords']/text()").extract()[0])[0])
        dict = {}
        is_Latest = True
        global Latest_info
        global previous_hash

        for i in range(1, Total_records + 1):
            content = response.xpath("//table/tbody/tr[{}]//text()".format(i)).extract()

            # Use the group function to group the list by key
            result = list(group(content, self.keys))
            Time = dict['Time'] = result[0].get(self.keys[0])
            Code = dict['Code'] = result[1].get(self.keys[1])
            dict['Name'] = result[2].get(self.keys[2])
            if is_Latest:
                Latest_info = str(Time) + " | " + str(Code)
                is_Latest = False

            yield dict

        previous_hash = get_hash(Latest_info.encode('utf-8'))
        #Monitor data updates and crawl for new data
        while True:
            time.sleep(10)
            # Request website content and calculate hash values
            yield scrapy.Request(url=self.start_urls, callback=self.parse_check, dont_filter=True)

My own callback function is

    def parse_check(self, response):
        global previous_hash
        global Latest_info
        dict = {}
        content = response.xpath("//table/tbody/tr[1]//text()").extract()
        # Use the group function to group the list by key
        result = list(group(content, self.keys))
        Time =  result[0].get(self.keys[0])
        Code = result[1].get(self.keys[1])

        current_info = str(Time) + " | " + str(Code)
        current_hash = get_hash(current_info.encode('utf-8'))

        # Compare hash values to determine if website content is updated
        if current_hash != previous_hash:

            dict['Time'] = Time
            dict['Code'] = Code
            dict['Name'] = result[2].get(self.keys[2])

            previous_hash = current_hash
            Latest_info = current_info
        yield dict

I tried to output the errback but there was no content, after that I tried to request the page with requests.get instead of yield scrapy.Request and that worked, but I still don't know why my callback function is not working

Scrapy requests - My own callback function is not being called

Answers (1)

Related Questions