Reputation: 31
I am trying to create a spider with the package "Scrapy" that gets a lists of URLs and crawls them. I have searched stackoverflow for an answer but could not find something that will solve the issue.
My script is as follows:
class Try(scrapy.Spider):
name = "Try"
def __init__(self, *args, **kwargs):
super(Try, self).__init__(*args, **kwargs)
self.start_urls = kwargs.get( "urls" )
print( self.start_urls )
def start_requests(self):
print( self.start_urls )
for url in self.start_urls:
yield Request( url , self.parse )
def parse(self, response):
d = response.xpath( "//body" ).extract()
When I crawl the spider:
Spider = Try(urls = [r"https://www.example.com"])
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(Spider)
process.start()
I get the following info printed while printing self.start_urls:
Why do I get None? Is there another way to approach this issue? or Is there any mistakes in my spider's class?
Thanks for any help given!
Upvotes: 0
Views: 1386
Reputation: 267
I would suggest to use the Spider Class in process.crawl
and pass urls
parameters there.
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request
class Try(scrapy.Spider):
name = 'Try'
def __init__(self, *args, **kwargs):
super(Try, self).__init__(*args, **kwargs)
self.start_urls = kwargs.get("urls")
def start_requests(self):
for url in self.start_urls:
yield Request( url , self.parse )
def parse(self, response):
d = response.xpath("//body").extract()
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(Try, urls=[r'https://www.example.com'])
process.start()
Upvotes: 2
Reputation: 143097
If I run
process.crawl(Try, urls=[r"https://www.example.com"])
then it send urls
to Try
as I expect. And even I don't need start_requests
.
import scrapy
class Try(scrapy.Spider):
name = "Try"
def __init__(self, *args, **kwargs):
super(Try, self).__init__(*args, **kwargs)
self.start_urls = kwargs.get("urls")
def parse(self, response):
print('>>> url:', response.url)
d = response.xpath( "//body" ).extract()
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(Try, urls=[r"https://www.example.com"])
process.start()
But if I use
spider = Try(urls = ["https://www.example.com"])
process.crawl(spider)
then it looks like it runs new Try
without urls
and then list is empty.
Upvotes: 0