Reputation: 7
I have a function to get all the links in the first page.
How can i make another function that make a request for each link in the list and get all the links from the second page response?
import scrapy
list = []
class SuperSpider(scrapy.Spider):
name = 'nytimes'
allowed_domains = ['nytimes.com']
start_urls = ['https://www.nytimes.com/']
def parse(self, response):
links = response.xpath('//a/@href').extract()
for link in links:
link = str(link).strip()
if link not in list:
list.append(link)
Upvotes: 0
Views: 384
Reputation: 2110
Your use case is perfect for scrapy crawl
spider. Note that the allowed_domains
setting is very important in this case as it defines the domains that will be crawled. If you remove it, then your spider will go crazy crawling all links it will find on each of the pages. See sample below.
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class NytimesSpider(CrawlSpider):
name = 'nytimes'
allowed_domains = ['nytimes.com']
start_urls = ['https://www.nytimes.com/']
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
}
rules = (
Rule(LinkExtractor(allow=r''), callback='parse_item', follow=True),
)
def parse_item(self, response):
yield {
"url": response.url
}
Upvotes: 1