Josh Korsik
Josh Korsik

Reputation: 47

(Python, Scrapy) Taking data from txt file into Scrapy spider

I am new at Python and Scrapy. I have a project. In the spider there is a code like that:

class MySpider(BaseSpider):
    name = "project"
    allowed_domains = ["domain.com"]
    start_urls = ["https://domain.com/%d" % i for i in range(12308128,12308148)]

I want to take the range numbers between 12308128 and 12308148 from a txt file (or csv file)

Lets say its numbers.txt including two lines in it:

12308128
12308148

How can I import these numbers to my spider? Another process will change these numbers in txt file periodically and my spider will update the numbers and run.

Thank you.

Upvotes: 1

Views: 1839

Answers (3)

mizhgun
mizhgun

Reputation: 1887

You can pass any parameters to spider's constructor through command line using option -a of scrapy crawl command for ex.)

scrapy crawl spider -a inputfile=filename.txt

then use it like this:

class MySpider(scrapy.Spider):
    name = 'spider'
    def __init__(self, *args, **kwargs):
        self.infile = kwargs.pop('inputfile', None)

    def start_requests(self):
        if self.infile is None:
            raise CloseSpider('No filename')
        # process file, name in self.infile

or you can just pass start and end values in similar way like this:

scrapy crawl spider -a start=10000 -a end=20000

Upvotes: 1

Granitosaurus
Granitosaurus

Reputation: 21436

You can override the start_urls logic in spider's start_requests() method:

class Myspider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        # read file data
        with open('filename', 'r') as f: 
            start, end = f.read().split('\n', 1)
        # make range and urls with your numbers
        range_ = (int(start.strip()), int(end.strip()))
        start_urls = ["https://domain.com/%d" % i for i in range(range_)]
        for url in start_urls:
            yield scrapy.Request(url)

This spider will open up file, read the numbers, create starting urls, iterate through them and schedule a request for each one of them.

Default start_requests() method looks something like:

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url)

So you can see what we're doing here by overriding it.

Upvotes: 1

Shijo
Shijo

Reputation: 9711

I believe you need to read the file and pass the values to your url string

Start_Range = datacont.readline()
End_Range = datacont.readline()
print Start_Range
print End_Range

Upvotes: 0

Related Questions