Reputation: 1920
I have a NodeJS script, that scrapes URLs everyday. The requests are throttled to be kind to the server. This results in my script running for a fairly long time (several hours).
I have been looking for a way to deploy it on GCP. And because it was previously done in cron, I naturally had a look at how to have a cronjob running on Google Cloud. However, according to the docs, the script has to be exposed as an API and http calls to that API can only run for up to 60 minutes, which doesn't fit my needs.
I had a look at this S.O question, which recommends to use a Cloud Function. However, I am unsure this approach would be suitable in my case, as my script requires a lot more processing than the simple server monitoring job described there.
Has anyone experience in doing this on GCP ?
N.B : To clarify, I want to to avoid deploying it on a VPS.
Edit : I reached out to google, here is their reply :
Thank you for your patience. Currently, it is not possible to run cron script for 6 to 7 hours in a row since the current limitation for cron in App Engine is 60 minutes per HTTP request. If it is possible for your use case, you can spread the 7 hours to recurrring tasks, for example, every 10 minutes or 1 hour. A cron job request is subject to the same limits as those for push task queues. Free applications can have up to 20 scheduled tasks. You may refer to the documentation for cron schedule format.
Also, it is possible to still use Postgres and Redis with this. However, kindly take note that Postgres is still in beta.
As I a can't spread the task, I had to keep on managing a dokku VPS for this.
Upvotes: 6
Views: 4200
Reputation: 1714
I'm also working on this task. I need to crawl website and have the same problem.
Instead of running the main crawler task on the VM, I move the task to Google Cloud Functions. The task is consist of add get the target url, scrape the web, and save the result to Datastore, then return the result to caller.
This is how it works, I have a long run application that call be called a master. The master know what URL we are going to access in to. But instead of access the target website by itself, it sends the url to a crawler function in GCF. Then the crawling tasked is done and send result back to the master. In this case, the master only request and get a small amount of data and never touch the target website, let the rest to GCF. You can off load your master and crawl the website in parallel via GCF. Or you can use other method to trigger GCF instead of HTTP request too.
Upvotes: 0
Reputation: 916
I would suggest combining two services, GAE Cron Jobs and Cloud Tasks.
Use GAE Cron jobs to publish a list of sites and ranges to scrape to Cloud Tasks. This initialization process doesn't need to be 'kind' to the server yet, and can simple publish all chunks of works to the Cloud Task queue, and consider itself finished when completed.
Follow that up with a Task Queue, and use the queue rate limiting configuration option as the method of limiting the overall request rate to the endpoint you're scraping from. If you need less than 1 qps add a sleep statement in your code directly. If you're really queueing millions or billions of jobs follow their advice of having one queue spawn to another.
Large-scale/batch task enqueues When a large number of tasks, for example millions or billions, need to be added, a double-injection pattern can be useful. Instead of creating tasks from a single job, use an injector queue. Each task added to the injector queue fans out and adds 100 tasks to the desired queue or queue group. The injector queue can be sped up over time, for example start at 5 TPS, then increase by 50% every 5 minutes.
That should be pretty hands off, and only require you to think through the process of how the cron job pulls the next desired sites and pages, and how small it should break down the work loads into.
Upvotes: 0