MITHU
MITHU

Reputation: 154

Unable to fetch some links using list comprehension within scrapy

I've written a script in python using scrapy to get the links from response after making a post request to a certain url. The links are perfectly coming through when I try with the following script.

Working one:

import scrapy
from scrapy.crawler import CrawlerProcess

class AftnetSpider(scrapy.Spider):
    name = "aftnet"
    base_url = "http://www.aftnet.be/MyAFT/Clubs/SearchClubs"

    def start_requests(self):
        yield scrapy.FormRequest(self.base_url,callback=self.parse,formdata={'regions':'1,3,4,6'})

    def parse(self,response):
        for items in response.css("dl.club-item"):
             for item in items.css("dd a[data-toggle='popover']::attr('data-url')").getall():
                yield {"result_url":response.urljoin(item)}

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',

    })
    c.crawl(AftnetSpider)
    c.start()

However, my intention is to achieve the same using list comprehension but I'm getting some error.

Using list comprehension:

def parse(self,response):
    return [response.urljoin(item) for items in response.css("dl.club-item") for item in items.css("dd a[data-toggle='popover']::attr('data-url')").getall()]

I get the following error:

2019-03-08 12:45:44 [scrapy.core.scraper] ERROR: Spider must return Request, BaseItem, dict or None, got 'str' in <POST http://www.aftnet.be/MyAFT/Clubs/SearchClubs>

How can I get some links using list comprehension within scrapy?

Upvotes: 0

Views: 149

Answers (1)

BoarGules
BoarGules

Reputation: 16952

Your generator with a loop is returning a single dict on every call:

yield {"result_url":response.urljoin(item)}

But your list comprehension is returning a list of strings. I don't know why you want a list comprehension here: your generator is much easier to understand (as shown by the fact that you have got it to work and are having trouble with the list comprehension) but if you insist on doing it, what you need is a list of dicts not strings, something like

return [{"result_url":response.urljoin(item)} for items in response.css("dl.club-item") for item in items.css("dd a[data-toggle='popover']::attr('data-url')").getall()]

But please don't do that. Remember that readability counts. Your generator is readable, your one-liner isn't.

Upvotes: 1

Related Questions