redeyes
redeyes

Reputation: 61

Scrape multiple URLs with Scrapy

How can I scrape multiple URLs with Scrapy?

Am I forced to make multiple crawlers?

class TravelSpider(BaseSpider):
    name = "speedy"
    allowed_domains = ["example.com"]
    start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4),"http://example.com/superurl/top/page-%d/" % i for i in xrange(55)]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        item = TravelItem()
        item['url'] = hxs.select('//a[@class="out"]/@href').extract()
        out = "\n".join(str(e) for e in item['url']);
        print out

Python says:

NameError: name 'i' is not defined

But when I use one URL it works fine!

start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4)"]

Upvotes: 2

Views: 5889

Answers (3)

MJ_0826
MJ_0826

Reputation: 41

There are only four ranges in Python: LEGB, because the local scope of the class definition and the local extent of the list derivation are not nested functions, so they do not form the Enclosing scope.Therefore, they are two separate local scopes that cannot be accessed from each other.

so, don't use 'for' and class variables at the same time

Upvotes: 0

Shane Evans
Shane Evans

Reputation: 2254

Your python syntax is incorrect, try:

start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4)] + \
    ["http://example.com/superurl/top/page-%d/" % i for i in xrange(55)]

If you need to write code to generate start requests, you can define a start_requests() method instead of using start_urls.

Upvotes: 3

alecxe
alecxe

Reputation: 473763

You can initialize start_urls in __init__.py method:

from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider


class TravelItem(Item):
    url = Field()


class TravelSpider(BaseSpider):
    def __init__(self, name=None, **kwargs):
        self.start_urls = []
        self.start_urls.extend(["http://example.com/category/top/page-%d/" % i for i in xrange(4)])
        self.start_urls.extend(["http://example.com/superurl/top/page-%d/" % i for i in xrange(55)])

        super(TravelSpider, self).__init__(name, **kwargs)

    name = "speedy"
    allowed_domains = ["example.com"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        item = TravelItem()
        item['url'] = hxs.select('//a[@class="out"]/@href').extract()
        out = "\n".join(str(e) for e in item['url']);
        print out

Hope that helps.

Upvotes: 3

Related Questions