giaosudau
giaosudau

Reputation: 2251

How to add proxy like Tor when scraping using node.io?

I am using node.io to build a web scraper but during the time to find the way to do it, I've requested so much and this site has blocked me. I don't know how to add a proxy like using Tor to make request to this site.

Upvotes: 1

Views: 4640

Answers (3)

buycanna.io
buycanna.io

Reputation: 1204

apt-get install tor

npm install tor-request --or-- yarn add tor-request

https://www.npmjs.com/package/tor-request

Upvotes: -1

giaosudau
giaosudau

Reputation: 2251

Follow to this article http://pkmishra.github.io/blog/2013/03/18/how-to-run-scrapy-with-TOR-and-multiple-browser-agents-part-1-mac/

I install tor and polipo. Polipo to connect to Tor and Node.IO will use http proxy polipo provide. It seem simple than what I think. And set proxy for scraper

    var scrap = new Scraper({
    start: 0,
    limit: 5,
    count: null,
    max: config.max || 0,
    debug: true,
    wait: 3,
    proxy: 'http://127.0.0.1:8123'
});

It works fine.

Upvotes: 5

halfer
halfer

Reputation: 20429

We would really need to see what sort of site this is, why you are scraping it, and ideally what specific site it is, in order give advice. Do you know why you were blocked?

The first thought I have is that you have been crawling the site too fast, and that you have been blocked quite legitimately for this reason. If your business relies on the crawling of just one site (e.g. prices from eBay) then you need to do it with a delay of a few seconds between each click.

I tend to take the view that site operators are perfectly within their rights to block specific scrapers if they wish. However, this view can be influenced by notions of a "common good", such as reducing the effect of partial monopoly. For example, I know someone who used to scrape prices from sites in a particular industry, and then reformat and resell that data. The effect of that data was to make the whole industry more competitive and lower prices to the consumer.

Thus, one of the target sites decided to block the crawler. Was the objection to their resource being consumed with no chance of a sale, or because they didn't like the competitive effect of the robot? Difficult to say - probably both. The scraper has now been replaced by humans - which are more expensive to operate, but gets the data anyway.

Thus, there are potential arguments to using proxies, but in most cases I think they are a bad idea. For example, if you are intending to take someone's news articles and redisplay them elsewhere without adding any value, then of course you should be blocked. Where one draws the line, though, is complicated.


Related: my answer here offers some advice on how to crawl, including the general advice on avoiding proxies and having an easily blockable user agent. Perhaps that might be useful?

Upvotes: -1

Related Questions