supertreta
supertreta

Reputation: 421

Scrapy with TOR (Windows)

I created a Scrapy project with several spiders to crawl some websites. Now I want to use TOR to:

  1. Hide my ip from the crawled servers;
  2. Associate my requests to different ips, simulating accesses from different users.

I have read some info about this, for example: using tor with scrapy framework, How to connect to https site with Scrapy via Polipo over TOR?

The answers from these links weren't helpful to me. What are the steps that I should take to make Scrapy work properly with TOR?

EDIT 1:

Considering answer 1, I started by installing TOR. As I am using Windows I downloaded the TOR Expert Bundle (https://www.torproject.org/dist/torbrowser/5.0.1/tor-win32-0.2.6.10.zip) and read the chapter about how to configure TOR as a relay (https://www.torproject.org/docs/tor-doc-windows.html.en). Unfortunately there is little or any information about how to do it on Windows. If I unzip the downloaded archive and run the file Tor\Tor.exe nothing happens. However, I can see in the Task Manager that a new process is instantiated. I don't know what is the best way to proceed from here.

Upvotes: 14

Views: 8496

Answers (2)

supertreta
supertreta

Reputation: 421

After a lot of research, I found a way to setup my Scrapy project to work with TOR on Windows OS:

  1. Download TOR Expert Bundle for Windows (1) and unzip the files to a folder (ex. \tor-win32-0.2.6.10).
  2. The recent TOR's versions for Windows don't come with a graphical user interface (2). It is probably possible to setup TOR only through config files and cmd commands but for me, the best option was to use Vidalia. Download it (3) and unzip the files to a folder (ex. vidalia-standalone-0.2.21-win32). Run "Start Vidalia.exe" and go to Settings. On the "General" tab, point Vidalia to TOR (\tor-win32-0.2.6.10\Tor\tor.exe).

  3. Check on "Advanced" tab and "Tor Configuration File" section the torrc file. I have the next ports configured:

    ControlPort 9151 SocksPort 9050

  4. Click Start Tor on the Vidalia Control Panel UI. After some processing you should se on the status the message "Connected to the Tor network!".

  5. Download Polipo proxy (4) and unzip the files to a folder (ex. polipo-1.1.0-win32). Read about this proxy on the link 5.

  6. Edit the file config.sample and add the next lines to it (in the beginning of the file, for example):

    socksParentProxy = "localhost:9050" socksProxyType = socks5 diskCacheRoot = ""

  7. Start Polipo through cmd. Go to the folder where you unzipped the files and enter the next command "polipo.exe -c config.sample".

  8. Now you have Polipo and TOR up and running. Polipo will redirect any request to TOR through port 9050 with SOCKS protocol. Polipo will receive any HTTP request to redirect trough port 8123.

  9. Now you can follow the rest of the tutorial "Torifying Scrapy Project On Ubuntu" (6). Continue in the step where the tutorial explains how to test the TOR/Polipo communications.

Links:

  1. https://www.torproject.org/download/download.html.en
  2. https://tor.stackexchange.com/questions/6496/tor-expert-bundle-on-windows-no-installation-instructions
  3. https://people.torproject.org/~erinn/vidalia-standalone-bundles/
  4. http://www.pps.univ-paris-diderot.fr/~jch/software/files/polipo/
  5. http://www.pps.univ-paris-diderot.fr/~jch/software/polipo/tor.html
  6. http://blog.privatenode.in/torifying-scrapy-project-on-ubuntu

Upvotes: 15

user4125604
user4125604

Reputation:

A detailed step-by-step Explanation is here http://blog.privatenode.in/torifying-scrapy-project-on-ubuntu/

The Basic steps there are:

  1. Install Tor and Polipo (for linux this might require to add a repository).
  2. Configure Polipo to talk with TOR using SOCK Connection (see above link).
  3. Create a custom Middleware to use tor as a http proxy and to randomly change the scrapy user agent
  4. to suppress depreciation warning from above example, write 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, instead of 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,

What is your szenario? Have you thought about renting Proxy Servers?

Upvotes: 4

Related Questions