Reputation: 339
I am wanting to store a copy of the web crawling software Screaming Frog in Google Cloud. This would be in a Compute Engine instance or perhaps a Kubernetes container.
It is possible to run screaming frog crawls locally on my computer using linux shell:
screamingfrogseospider --crawl https://www.example.com --headless --save-crawl --output-folder /tmp/cli
Is it possible to do something similar in Google Cloud?
Ideally, I would like to schedule some kind of cron task that causes the above shell command to run, causing a website to be crawled; with the resulting crawl saved to a bucket in Google Cloud storage.
How might I do this?
Also, can I set up and schedule cron tasks in GCP using a programming language such as Python? The idea would be for people in my organisation to be able to login into a UI (probably built in Flask) and schedule crawls themselves. Flask would then connect to Google Cloud and configure the task.
Upvotes: 0
Views: 373
Reputation: 652
You can use GCP cloud scheduler. At this link yo can find an example of how to start and stop a Compute engine with cron scheduler [https://cloud.google.com/scheduler/docs/start-and-stop-compute-engine-instances-on-a-schedule]. In the GCE startup script you can set your command for site crawling.
Another option is Cloud composer, you can write a DAG scheduled when you need, and run the shell command with the airflow bash operator (Cloud composer is the GCP airflow implementation)
Upvotes: 3