Reputation: 1322
I have a several python scripts that follow a similar format: you pass in a date, and it either: - checks my S3 bucket for the file with that date in the filename, and parses it or - Runs a python script doing some analysis on the file of that date. The important thing is that I need to have a timeout of at least 1 hour.
I am looking for a serverless solution that would let me call these functions on a range of dates, and run them all in parallel. Because of the long duration of my python script, services like AWS and Google Cloud Functions don't work because of their timeouts (15 minutes and 9 minutes respectively). I have looked at Google Cloud Dataflow, but am not sure whether this is overkill for my relatively simple use case.
Something with the lowest possible outages is important, so I am leaning towards something from AWS, Google Cloud, etc.
I also would like to be able to see a dashboard of the progress of each job with logs, so I can see which dates have completed and which dates had a bug (plus what the bug is)
Upvotes: 0
Views: 384
Reputation: 1754
AWS Fargate may be a good and simple choice for running the hour-long task. It does support scheduled and event based tasks as well, so you could, for instance, process the asset once it's uploaded on S3, or ran it on daily basis using cron.
More documentation here on scheduled tasks
If you want some more complex handling of batch processing you can use AWS Batch, though in my experience this approach asks for more orchestration effort (and gives higher flexibility).
Serverless.com has great blog post on how to use Fargate to run long-running tasks
Upvotes: 0
Reputation: 15276
The nature of a micro server such as AWS Lambda or GCP Cloud Functions is that, by their definition, short running. If a computational resource might run for a long time (and I consider an hour to be a long time) then the microservice story isn't a good match. Let us now look at what we do actually desire. I am presuming that you want:
One possible solution is to use GCP Compute Engines and the notion of the "managed instance group". Using this technology you define a Compute Engine template that will spin up either a Linux or Windows VM instance with as many (or as few) CPUs and RAM as needed. The number of instances is a function of how you define load ... including dropping to zero. When you define your Compute Engine template, you have 100% control over it including defining initial startup applications through a startup script. I could imagine you writing a startup script that runs your application.
While this is indeed more work than the mantra of "you bring the code and we bring everything else" it is the state of play. Other potential solutions are (as you were alluding to) examine the nature of your processed and sub-divide it into finer grained and smaller work units.
If Compute Engines feel like too much, an alternative would be to embrace Kubernetes and have a cluster with pods containing your application.
References:
Upvotes: 2