Ajmul
Ajmul

Reputation: 37

Scrapy download data from links where certain other condition is fulfilled

I am extracting data from Imdb lists and it is working fine. I provide a link for all lists related to an imdb title, the code opens all lists and can pretty extract data what I want.

class lisTopSpider(scrapy.Spider):
    name= 'ImdbListsSpider'
    allowed_domains = ['imdb.com']
    start_urls = [
        'https://www.imdb.com/lists/tt2218988'
    ]

    #lists related to given title
    def parse(self, response):
        #Grab list link section
        listsLinks = response.xpath('//div[2]/strong')

        for link in listsLinks:
            list_url = response.urljoin(link.xpath('.//a/@href').get())
                yield scrapy.Request(list_url, callback=self.parse_list, meta={'list_url': list_url})

Now what is the issue, is that I want this code to skip all lists that have more than 50 titles and get data where lists have less than 50 titles. Problem with it is that list link is in separate block of xpath and number of titles is in another block. enter image description here

So I tried the following.

for link in listsLinks:
        list_url = response.urljoin(link.xpath('.//a/@href').get())
        numOfTitlesString = response.xpath('//div[@class="list_meta"]/text()[1]').get()
        numOfTitles = int(''.join(filter(lambda i: i.isdigit(), numOfTitlesString)))
        print ('numOfTitles' , numOfTitles)
        if numOfTitles < 51:
            yield scrapy.Request(list_url, callback=self.parse_list, meta={'list_url': list_url})

But it gives me empty csv file. When I try to print numOfTitles in for loop, it gives me result of very first xpath found for all rounds of the loop.

Please suggest a solution for this.

Upvotes: 1

Views: 100

Answers (1)

Wim Hermans
Wim Hermans

Reputation: 2116

As Gallaecio mentioned, it's just an xpath issue. It's normal you keep getting the same number, because you're executing the exact same xpath to the exact same response object. In the below code we get the whole block (instead of just the part that contains the url), and for every block we get the url and the number of titles.

list_blocks = response.xpath('//*[has-class("list-preview")]')
for block in list_blocks:
    list_url = response.urljoin(block.xpath('./*[@class="list_name"]//@href').get())
    number_of_titles_string = block.xpath('./*[@class="list_meta"]/text()').get()
    number_of_titles = int(''.join(filter(lambda i: i.isdigit(), number_of_titles_string)))

Upvotes: 1

Related Questions