Reputation: 47
I am new at Python and Scrapy. I have a project. In the spider there is a code like that:
class MySpider(BaseSpider):
name = "project"
allowed_domains = ["domain.com"]
start_urls = ["https://domain.com/%d" % i for i in range(12308128,12308148)]
I want to take the range numbers between 12308128
and 12308148
from a txt file (or csv file)
Lets say its numbers.txt including two lines in it:
12308128
12308148
How can I import these numbers to my spider? Another process will change these numbers in txt file periodically and my spider will update the numbers and run.
Thank you.
Upvotes: 1
Views: 1839
Reputation: 1887
You can pass any parameters to spider's constructor through command line using option -a
of scrapy crawl
command for ex.)
scrapy crawl spider -a inputfile=filename.txt
then use it like this:
class MySpider(scrapy.Spider):
name = 'spider'
def __init__(self, *args, **kwargs):
self.infile = kwargs.pop('inputfile', None)
def start_requests(self):
if self.infile is None:
raise CloseSpider('No filename')
# process file, name in self.infile
or you can just pass start and end values in similar way like this:
scrapy crawl spider -a start=10000 -a end=20000
Upvotes: 1
Reputation: 21436
You can override the start_urls logic in spider's start_requests()
method:
class Myspider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
# read file data
with open('filename', 'r') as f:
start, end = f.read().split('\n', 1)
# make range and urls with your numbers
range_ = (int(start.strip()), int(end.strip()))
start_urls = ["https://domain.com/%d" % i for i in range(range_)]
for url in start_urls:
yield scrapy.Request(url)
This spider will open up file, read the numbers, create starting urls, iterate through them and schedule a request for each one of them.
Default start_requests()
method looks something like:
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url)
So you can see what we're doing here by overriding it.
Upvotes: 1
Reputation: 9711
I believe you need to read the file and pass the values to your url string
Start_Range = datacont.readline()
End_Range = datacont.readline()
print Start_Range
print End_Range
Upvotes: 0