Reputation: 14970
I am using wget to download a huge list of web pages(around 70,000).I am forced to put a sleep of around 2 seconds in between successive wget.This takes a huge amount of time.Something like 70 days.What I would like to do is to use proxies so that I can significantly speed up the process.I am using a simple bash script for this process.Any suggestions and comments are appreciated.
Upvotes: 3
Views: 2877
Reputation: 49085
First suggestion is to not use Bash or wget. I would use Python and Beautiful Soup. Wget is not really designed for screen scraping.
Second look into spreading the load across multiple machines by running a portion of your list on each machine.
Since it sounds like bandwidth is your issue you can easily spawn up some cloud images and throw your script on those guys.
Upvotes: 3