Using Azure to process large amounts of data

Question

We have an application that over time stores immense amounts of data for our users (talking hundreds of TB or more here). Due to new EU directives, should a user decide to discontinue using our sevices, all their data must be available for export for the next 80 days, after which it MUST be eradicated completely. The data is stored in azure storage block blobs, and the metadata in an sql database.

Sadly, the data cannot be exported as-is (it is in a proprietary format), so it would need to be processed and converted to PDF for export. A file is approximately 240KB in size, so imagine the amount of PDFs for the TB value stated above.

We tried using functions to split the job into tiny 50 value chunks, but it went haywire at some point and created enormous costs, spinning out of control.

So what we're looking for is this:

Can be run on demand from a web trigger/queue/db entry
Is pay-what-you-use as this will occur at random times and (so we hope) rarely.
Can process extreme amounts of data fairly effectively at minimum cost
Is easy to maintain and keep track of. The functions jobs were just fire and pray -utter chaos- due to their amount and parallel processing.

Does anyone know of a service fitting our requirements?

kkirk · Accepted Answer

Here's a getting started link for .NET, python or node.js: https://learn.microsoft.com/en-us/azure/batch/batch-dotnet-get-started

The concept in batch is pretty simple, although it takes a bit of fiddling to get it working the first time, in my experience. I'll try to explain what it involves, to the best of my ability. Any suggestions or comments are welcome.

The following concepts are important:

The Pool. This is an abstraction of all the nodes (i.e. virtual machines) that you provision to do work. These could be running Linux, Windows Server or any of the other offerings that Azure has. You can provision a pool through the API.
The Jobs. This is an abstraction where you place the 'Tasks' you need executed. Each task is a command-line execution of your executable, possibly with some arguments.

Your tasks are picked one by one by an available node in the pool and it executes the command that the task specifies. Available on the node are your executables and a file that you assigned to the task, containing some data identifying, in your case e.g. which users should be processed by the task.

So suppose in your case that you need to perform the processing for 100 users. Each individual processing job is an execution of some executable you create, e.g. ProcessUserData.exe. As an example, suppose your executable takes, in addition to a userId, an argument specifying whether this should be performed in test or prod, so e.g. ProcessUserData.exe "path to file containing user ids to process" --environment test. We'll assume that your executable doesn't need other input than the user id and the environment in which to perform the processing.

You upload all the application files to a blob (named "application blob" in the following). This consists of your main executable along with any dependencies. It will all end up in a folder on each node (virtual machine) in your pool, once provisioned. The folder is identified through an environment variable created on each node in your pool so that you can find it easily. See https://learn.microsoft.com/en-us/azure/batch/batch-compute-node-environment-variables
In this example, you create 10 input files, each containing 10 userIds (100 userIds total) that should be processed. One file for each of the command line tasks. Each file could contain 1 user id or 10 userids, it's entirily up to you how you want your main executable to parse this file and process the input. You upload these to the 'input' blob container. These will also end up in a directory identified by an environment variable on each node so are also easy to construct a path in your command line activity on each node.

When uploaded to the input container, you will receive a reference (ResourceFile) to each input file. One input file should be associated with one "Task" and each task is passed to an available node as the job executes.

The details of how to do this are clear from the getting started link, I'm trying to focus on the concepts, so I won't go into much detail.

You now create the tasks (CloudTask) to be executed, specify what it should run on the command line, and add them to the job. Here you reference the input file that each task should take as input. An example (assuming Windows cmd):

cmd /c %AZ_BATCH_NODE_SHARED_DIR%\ProcessUserdata.exe %AZ_BATCH_TASK_DIR%\userIds1.txt --environment test

Here, userIds1.txt is the filename your first ResourceFile returned when you uploaded the input files. The next command will specify userIds2.txt, etc.

When you've created your list of CloudTask objects containing the commands, you add them to the job, e.g in C#.

await batchClient.JobOperations.AddTaskAsync(jobId, tasks);

And now you wait for the job to finish.

What happens now is that Azure batch looks at the nodes in the pool and while there are more tasks in the tasks list, it assigns a task to an available (idle) node.

Once completed (which you can poll for through the API), you can delete the pool, the job and pay only for the compute that you've used.

One final note: Your tasks may depend on external packages, i.e. an execution environment that is not installed by default on the OS you've selected, so there are a few possible ways of resolving this: 1. Upload an application package, which will be distributed to each node as it enters the pool (again, there's an environment variable pointing to it). This can be done through the Azure Portal. 2. Use a command line tool to get what you need, e.g. apt-get install on Ubuntu.

Hope that gives you an overview of what Batch is. In my opinion the best way to get started is to do something very simple, i.e. print environment variables in a single task on a single node.

You can inspect the stdout and stderr of each node while the execution is underway, again through the portal.

There's obviously a lot more to it than this, but this is a basic guide. You can create linked tasks and a lot of other nifty things, but you can read up on that if you need it.

Using Azure to process large amounts of data

Answers (2)

Related Questions