bdhar
bdhar

Reputation: 22983

Scrapy + Tor + Mongodb

I face a problem when using Scrapy + Mongodb with Tor. I get the following error when I try to have a mongodb pipeline in Scrapy.

2012-11-05 13:41:14-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
|S-chain|-<>-127.0.0.1:9050-<><>-127.0.0.1:27017-<--denied
Traceback (most recent call last):
  File "/usr/bin/scrapy", line 4, in <module>
    execute()
  File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 131, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 97, in _run_print_help
    func(*a, **kw)
  File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 138, in _run_command
    cmd.run(args, opts)
  File "/usr/lib/python2.7/dist-packages/scrapy/commands/crawl.py", line 42, in run
    q = self.crawler.queue
  File "/usr/lib/python2.7/dist-packages/scrapy/command.py", line 33, in crawler
    self._crawler.configure()
  File "/usr/lib/python2.7/dist-packages/scrapy/crawler.py", line 43, in configure
    self.engine = ExecutionEngine(self.settings, self._spider_closed)
  File "/usr/lib/python2.7/dist-packages/scrapy/core/engine.py", line 33, in __init__
    self.scraper = Scraper(self, self.settings)
  File "/usr/lib/python2.7/dist-packages/scrapy/core/scraper.py", line 66, in __init__
    self.itemproc = itemproc_cls.from_settings(settings)
  File "/usr/lib/python2.7/dist-packages/scrapy/middleware.py", line 33, in from_settings
    mw = mwcls()
  File "/home/bharani/ABCD_scraper/political_forum_scraper/pipelines.py", line 9, in __init__
    settings['MONGODB_PORT'])
  File "/usr/local/lib/python2.7/dist-packages/pymongo/connection.py", line 290, in __init__
    self.__find_node()
  File "/usr/local/lib/python2.7/dist-packages/pymongo/connection.py", line 586, in __find_node
    raise AutoReconnect(', '.join(errors))
pymongo.errors.AutoReconnect: could not connect to localhost:27017: [Errno 111] Connection refused

I am not sure how to resolve this. When I do not use proxychains, it crawls perfectly fine.

Any help is appreciated.

Thanks.


Edit:

It's not code specific. See this link: http://isbullsh.it/2012/04/Web-crawling-with-scrapy/

This is a simple tutorial to use Scrapy with MongoDB. We are supposed to call

scrapy crawl isbullshit

to run the crawler which works perfectly fine. To use Tor, it should be called like this:

proxychains scrapy crawl isbullshit

Which does not work for me. The source code of the tutorial is here: https://github.com/BaltoRouberol/isbullshit-crawler

Upvotes: 0

Views: 2288

Answers (3)

user2889561
user2889561

Reputation: 1

Open mongo connection before setting the socks proxy

Upvotes: 0

Francois Dang Ngoc
Francois Dang Ngoc

Reputation: 86

It might be that it's trying to redirect your MongoDB connection (localhost:27017) to TOR. If you want to exclude localhost connections from proxychains, you can add the following line to your /etc/proxychains.conf:

localnet 127.0.0.1 000 255.255.255.255

Upvotes: 1

user689383
user689383

Reputation:

pymongo.errors.AutoReconnect: could not connect to localhost:27017: [Errno 111] Connection refused

It seems you cannot connect to the localhost on port 27017. Is this the correct port and correct host? Make sure about that, also make sure mongodb server is running on the background otherwise you will never connect it.

If mongodb is running in the background, remove the mongodb.lock

rm -r/var/lib/mongodb

and restart the server, something like;

sudo service mongodb start

in Debian or

sudo systemctl restart mongodb

in Arch Linux

Upvotes: 2

Related Questions