wdetac
wdetac

Reputation: 2832

Scrapy - Scrape multiple URLs using results from the first URL

  1. I use Scrapy to scrape data from the first URL.
  2. The first URL returns a response contains a list of URLs.

So far is ok for me. My question is how can I further scrape this list of URLs? After searching, I know I can return a request in the parse but it seems only can process one URL.

This is my parse:

def parse(self, response):
    # Get the list of URLs, for example:
    list = ["http://a.com", "http://b.com", "http://c.com"]
    return scrapy.Request(list[0])
    # It works, but how can I continue b.com and c.com?

May I do something like that?

def parse(self, response):
    # Get the list of URLs, for example:
    list = ["http://a.com", "http://b.com", "http://c.com"]

    for link in list:
        scrapy.Request(link)
        # This is wrong, though I need something like this

Full version:

import scrapy

class MySpider(scrapy.Spider):
    name = "mySpider"
    allowed_domains = ["x.com"]
    start_urls = ["http://x.com"]

    def parse(self, response):
        # Get the list of URLs, for example:
        list = ["http://a.com", "http://b.com", "http://c.com"]

        for link in list:
            scrapy.Request(link)
            # This is wrong, though I need something like this

Upvotes: 2

Views: 9852

Answers (3)

Samyak Jain
Samyak Jain

Reputation: 1

# within your parse method:

urlList = response.xpath('//a/@href').extract()  
print(urlList) #to see the list of URLs 
for url in urlList:
    yield scrapy.Request(url, callback=self.parse)

This should work

Upvotes: 0

Frank Martin
Frank Martin

Reputation: 2594

I think what you're looking for is the yield statement:

def parse(self, response):
    # Get the list of URLs, for example:
    list = ["http://a.com", "http://b.com", "http://c.com"]

    for link in list:
        request = scrapy.Request(link)
        yield request

Upvotes: 6

Alexandre
Alexandre

Reputation: 1683

For this purpose, you need to subclass scrapy.spider and define a list of URLs to start with. Then, Scrapy will automatically follow the links it finds.

Just do something like this:

import scrapy

class YourSpider(scrapy.Spider):
    name = "your_spider"
    allowed_domains = ["a.com", "b.com", "c.com"]
    start_urls = [
        "http://a.com/",
        "http://b.com/",
        "http://c.com/",
    ]

    def parse(self, response):
        # do whatever you want
        pass

You can find more information on the official documentation of Scrapy.

Upvotes: 0

Related Questions