B.Mr.W.
B.Mr.W.

Reputation: 19648

BOTO distribute scraping tasks among AWS

I have 200,000 URLs that I need to scrape from a website. This website has a very strict scraping policy and you will get blocked if the scraping frequency is 10+ /min. So I need to control my pace. And I am thinking about start a few AWS instances (say 3) to run in parallel.

In this way, the estimated time to collect all the data will be:

200,000 URL / (10 URL/min) = 20,000 min (one instance only) 4.6 days (three instances)

which is a legit amount of time to get my work done.

However, I am thinking about building a framework using boto. That I have a paragraph of code and a queue of input (a list of URLs) in this case. Meanwhile I also don't want to do any damage to their website so I only want to scrape during the night and weekend. So I am thinking about all of this should be controlled on one box.

And the code should look similar like this:

class worker (job, queue) url = queue.pop() aws = new AWSInstance() result aws.scrape(url) return result

worker1 = new worker() worker2 = new worker() worker3 = new worker()

worker1.start() worker2.start() worker3.start()

The code above is totally pseudo and my idea is to pass the work to AWS.

Question: (1) How to use boto to pass the variable/argument to another AWS instance and start a script to work on those variable and .. use boto to retrieve the result back to the master box. (2) What is the best way to schedule a job only on specific time period inside Python code. Say only work on 6:00pm to 6:00 am everyday... I don't think the Linux crontab will fit my need in this situation.

Sorry about that if my question is more verbally descriptive and philosophical.. Even if you can offer me any hint or throw away some package/library name that meet my need. I will be gratefully appreciated!

Upvotes: 0

Views: 277

Answers (1)

kukido
kukido

Reputation: 10601

Question: (1) How to use boto to pass the variable/argument to another AWS instance and start a script to work on those variable

Use shared datasource, such as DynamoDB or messaging framework such as SQS

and .. use boto to retrieve the result back to the master box.

Again, shared datasource, or messaging.

(2) What is the best way to schedule a job only on specific time period inside Python code. Say only work on 6:00pm to 6:00 am everyday... I don't think the Linux crontab will fit my need in this situation.

I think crontab fits well here.

Upvotes: 1

Related Questions