yanis
yanis

Reputation: 1

Scrapy requests - My own callback function is not being called

I want to request the page every once in a while to determine if the content has been updated, but my own callback function isn't being triggered My allowed_domains and request url are

allowed_domains = ['www1.hkexnews.hk']
start_urls = 'https://www1.hkexnews.hk/search/predefineddoc.xhtml?lang=zh&predefineddocuments=9'

The code for the parsing part is

#Crawl all data first at each start
    def parse(self, response):
        Total_records = int(re.findall("\d+",response.xpath("//div[@class='PD-TotalRecords']/text()").extract()[0])[0])
        dict = {}
        is_Latest = True
        global Latest_info
        global previous_hash

        for i in range(1, Total_records + 1):
            content = response.xpath("//table/tbody/tr[{}]//text()".format(i)).extract()

            # Use the group function to group the list by key
            result = list(group(content, self.keys))
            Time = dict['Time'] = result[0].get(self.keys[0])
            Code = dict['Code'] = result[1].get(self.keys[1])
            dict['Name'] = result[2].get(self.keys[2])
            if is_Latest:
                Latest_info = str(Time) + " | " + str(Code)
                is_Latest = False

            yield dict

        previous_hash = get_hash(Latest_info.encode('utf-8'))
        #Monitor data updates and crawl for new data
        while True:
            time.sleep(10)
            # Request website content and calculate hash values
            yield scrapy.Request(url=self.start_urls, callback=self.parse_check, dont_filter=True)

My own callback function is

    def parse_check(self, response):
        global previous_hash
        global Latest_info
        dict = {}
        content = response.xpath("//table/tbody/tr[1]//text()").extract()
        # Use the group function to group the list by key
        result = list(group(content, self.keys))
        Time =  result[0].get(self.keys[0])
        Code = result[1].get(self.keys[1])

        current_info = str(Time) + " | " + str(Code)
        current_hash = get_hash(current_info.encode('utf-8'))

        # Compare hash values to determine if website content is updated
        if current_hash != previous_hash:

            dict['Time'] = Time
            dict['Code'] = Code
            dict['Name'] = result[2].get(self.keys[2])

            previous_hash = current_hash
            Latest_info = current_info
        yield dict

I tried to output the errback but there was no content, after that I tried to request the page with requests.get instead of yield scrapy.Request and that worked, but I still don't know why my callback function is not working

Upvotes: 0

Views: 56

Answers (1)

yanis
yanis

Reputation: 1

I know why, at least this works for me, and that is to try not to use time.sleep in scrapy. because it will block the Twisted reactor (the underlying framework of Scrapy), which will completely block your Scrapy spiders and stop all Scrapy concurrency features. You can use the DOWNLOAD_DELAY function or use the AutoThrottle AutoThrottle

Upvotes: 0

Related Questions