twitu
twitu

Reputation: 625

Scrape using Scrapy using Urls taken from a list

class PractiseSpider(scrapy.Spider):
    name = "practise"
    allowed_domains = ["practise.com"]
    start_urls = ['https://practise.com/product/{}/']
    def parse(self, response):
        #do something
        #scrape with next url in the list

My list m contains the url needed to be added like product/{}/.format(m[i]) iteratively. How do I do this. Should I make new spider calls for each Url or should I write some code for the spider to automatically iterate the list. If the answer is the latter what do i write ?

I know there are many answers related to this, for e.g. this but i have a fixed and known list of urls.

Upvotes: 1

Views: 1173

Answers (2)

Granitosaurus
Granitosaurus

Reputation: 21406

Alternatively to overriding start_urls, you can override start_requests() method of your spider. This method yields requests that start off your spider.

By default your spider does this:

def start_requests(self):
    for url in self.start_urls:
        yield Request(url, dont_filter=True)

so you can modify this method in your spider to anything you want to:

def start_requests(self):
    ids = pop_ids_from_db()
    for id in ids:
        url = f'http://example.com/product/{id}'
        yield Request(url, dont_filter=True)

Upvotes: 2

Tomáš Linhart
Tomáš Linhart

Reputation: 10210

If you know the URLs beforehand, just populate start_urls. If you say m is a list of products (that's what I assume from what you wrote), then it would look like this:

start_urls = ['https://practise.com/product/{}/'.format(product) for product in m]

Upvotes: 3

Related Questions