Reputation: 5277
I have a certain script (python), which needs to be automated that is relatively memory and CPU intensive. For a monthly process, it runs ~300 times, and each time it takes somewhere from 10-24 hours to complete, based on input. It takes certain (csv) file(s) as input and produces certain file(s) as output, after processing of course. And btw, each run is independent.
We need to use configs and be able to pass command line arguments to the script. Certain imports, which are not default python packages, need to be installed as well (requirements.txt). Also, need to take care of logging pipeline (EFK) setup (as ES-K can be centralised, but where to keep log files and fluentd config?)
Last bit is monitoring - will we be able to restart in case of unexpected closure? Best way to automate this, tools and technologies?
Create a docker image of the whole setup (python script, fluent-d config, python packages etc.). Now we somehow auto deploy this image (on a VM (or something else?)), execute the python process, save the output (files) to some central location (datalake, eg) and destroy the instance upon successful completion of process.
So, is what I'm thinking possible in Azure? If it is, what are the cloud components I need to explore -- answer to my somehows and somethings? If not, what is probably the best solution for my use case?
Any lead would be much appreciated. Thanks.
Upvotes: 1
Views: 725
Reputation: 2848
You probably suppose to use Azure Data Factory for moving and transforming data. Then you can also use ADF for calling Azure Batch that will be using python. https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
Adding more info could probably suggest other better suggestions.
Upvotes: 0
Reputation: 29840
Normally for short living jobs I'd say use an Azure Function. Thing is, they have a maximum runtime of 10 minutes unless you put them on an App Service Plan. But that will costs more unless you manually stop/start the app service plan.
If you can containerize the whole thing I recommend using Azure Container Instances because you then only pay for what you actual use. You can use an Azure Function to start the container, based on an http request, timer or something like that.
You can set a restart policy to indicate what should happen in case of unexpected failures, see the docs.
Configuration can be passed from the Azure Function to the container instance or you could leverage the Azure App Configuration service.
Upvotes: 1
Reputation: 3592
I would go with Azure Devops and a custom agent pool. This agent pool could include some virtual machines (maybe only one) with docker installed. I would then install all the necessary packages that you mentioned on this docker container and also the DevOps agent (it will be needed to communicate with the agent pool).
You could pass every parameter needed in the build container agents through Azure Devops tasks and also have a common storage layer for build and release pipeline. This way you could mamipulate/process your files on the build pipeline and then using the same folder create a task on the release pipeline to export/upload those files somewhere.
As this script should run many times through the month, you could have many containers so that to run more than one job at a given time.
I follow the same procedure for a corporate environment. I keep a VM running windows with multiple docker machines to compile diferent code frameworks. Each container includes different tools and is registered to a custom agent pool. Jobs are distributed across those containers and build and release pipelines integrate with multiple processing.
Upvotes: 0
Reputation: 3814
Though I don't know all the details, this sounds like a good candidate for Azure Batch. There is no additional charge for using Batch. You only pay for the underlying resources consumed, such as the virtual machines, storage, and networking. Batch works well with intrinsically parallel (also known as "embarrassingly parallel") workloads.
The following high-level workflow is typical of nearly all applications and services that use the Batch service for processing parallel workloads:
(source)
Upvotes: 0