pablowilks2
pablowilks2

Reputation: 339

How to schedule site crawls in Google Cloud?

I am wanting to store a copy of the web crawling software Screaming Frog in Google Cloud. This would be in a Compute Engine instance or perhaps a Kubernetes container.

It is possible to run screaming frog crawls locally on my computer using linux shell:

screamingfrogseospider --crawl https://www.example.com --headless --save-crawl --output-folder /tmp/cli

Is it possible to do something similar in Google Cloud?

Ideally, I would like to schedule some kind of cron task that causes the above shell command to run, causing a website to be crawled; with the resulting crawl saved to a bucket in Google Cloud storage.

How might I do this?

Also, can I set up and schedule cron tasks in GCP using a programming language such as Python? The idea would be for people in my organisation to be able to login into a UI (probably built in Flask) and schedule crawls themselves. Flask would then connect to Google Cloud and configure the task.

Upvotes: 0

Views: 373

Answers (1)

Claudio
Claudio

Reputation: 652

You can use GCP cloud scheduler. At this link yo can find an example of how to start and stop a Compute engine with cron scheduler [https://cloud.google.com/scheduler/docs/start-and-stop-compute-engine-instances-on-a-schedule]. In the GCE startup script you can set your command for site crawling.

Another option is Cloud composer, you can write a DAG scheduled when you need, and run the shell command with the airflow bash operator (Cloud composer is the GCP airflow implementation)

Upvotes: 3

Related Questions