Jean Ventura
Jean Ventura

Reputation: 27

Scrapy always running with same parameters, persistant Twisted reactor?

I've been running my Scrapy project with a couple of accounts (the project scrapes a especific site that requieres login credentials), but no matter the parameters I set, it always runs with the same ones (same credentials).

I'm running under virtualenv. Is there a variable or setting I'm missing?

Edit:

It seems that this problem is Twisted related.

Even when I run:

scrapy crawl -a user='user' -a password='pass' -o items.json -t json SpiderName

I still get an error saying:

ERROR: twisted.internet.error.ReactorNotRestartable

And all the information I get, is the last 'succesful' run of the spider.

Upvotes: 2

Views: 1065

Answers (2)

Jean Ventura
Jean Ventura

Reputation: 27

Found the problem. My project tree was 'dirty'.

After another developer changed the name of the file that contained the spider code and I updated my local repo with those changes, this only deleted the .py version of the old file and left the .pyc file (cause of .hgignore). This was making Scrapy find the same spider module twice (since the same spider was under two different files), and calling them both under the same Twisted reactor.

After deleting the offending file everything is back to normal.

Upvotes: 1

nickzam
nickzam

Reputation: 813

You should check your spider's __init__ method, you should pass there username and password if it's not there. Like that:

class MySpider(BaseSpider):
name = 'myspider'

def __init__(self, username=None, password=None, *args, **kwargs):
    super(MySpider, self).__init__(*args, **kwargs)
    self.start_urls = ['http://www.example.com/']
    self.username = username
    self.password = password

def start_requests(self):
    return [FormRequest("http://www.example.com/login",
                    formdata={'user': self.username, 'pass': self.password,
                    callback=self.logged_in)]
def logged_in(self, response):
    # here you would extract links to follow and return Requests for
    # each of them, with another callback
    pass

Run it:

scrapy crawl myspider -a username=yourname password=yourpass

Code adapted from: http://doc.scrapy.org/en/0.18/topics/spiders.html

EDIT: You can have only one Twisted reactor. But you can use run multiple spiders in the same process with different credentials. Example of running multiple spiders: http://doc.scrapy.org/en/0.18/topics/practices.html#running-multiple-spiders-in-the-same-process

Upvotes: 2

Related Questions