Reputation: 11
allowed_domains = ["textfiles.com/100"]
start_urls = ['http://textfiles.com/100/']
def parse(self,response):
link=response.css('a::attr(href)').extract()
for i in link:
temp="http://www.textfiles.com/100/"+i
data=scrapy.Request(temp,callback=self.parsetwo)
def parsetwo(self,response):
print(response.text)
Upvotes: 1
Views: 55
Reputation: 1870
There are two problems with your current approach:
allowed_domains = ["textfiles.com/100"]
makes all subsequent requests fail due to the fact that the domain is actually textfiles.com
.I made those two changes and got it to work.
from scrapy import Spider
from scrapy import Request
class TextCrawler(Spider):
name = 'Text'
allowed_domains = ['textfiles.com']
start_urls = ['http://textfiles.com/100/']
def parse(self, response):
link = response.css('a::attr(href)').extract()
for i in link:
temp = 'http://textfiles.com/100/' + i
yield Request(temp, callback=self.parsetwo)
def parsetwo(self, response):
print(response.text)
Upvotes: 1