Reputation: 543
We have an application that over time stores immense amounts of data for our users (talking hundreds of TB or more here). Due to new EU directives, should a user decide to discontinue using our sevices, all their data must be available for export for the next 80 days, after which it MUST be eradicated completely. The data is stored in azure storage block blobs, and the metadata in an sql database.
Sadly, the data cannot be exported as-is (it is in a proprietary format), so it would need to be processed and converted to PDF for export. A file is approximately 240KB in size, so imagine the amount of PDFs for the TB value stated above.
We tried using functions to split the job into tiny 50 value chunks, but it went haywire at some point and created enormous costs, spinning out of control.
So what we're looking for is this:
Does anyone know of a service fitting our requirements?
Upvotes: 2
Views: 2568
Reputation: 84
Assuming lot of people are looking for a solution for these kind of requirements, the new release of ADLA (Azure Data Lake Analytics) supports the Parquet format now. This will be supported along with U-SQL. With less than 100 lines of code now you can read these small files to large files and with less number of resources (vertices) you can compress the data to Parquet files. For example, you can store 3TB data into 10000 parquet files. And reading these files is also very simple and as per the requirement you can create csv files in no time. This will save too much cost and time for you for sure.
Upvotes: 0
Reputation: 367
Here's a getting started link for .NET, python or node.js: https://learn.microsoft.com/en-us/azure/batch/batch-dotnet-get-started
The concept in batch is pretty simple, although it takes a bit of fiddling to get it working the first time, in my experience. I'll try to explain what it involves, to the best of my ability. Any suggestions or comments are welcome.
The following concepts are important:
Your tasks are picked one by one by an available node in the pool and it executes the command that the task specifies. Available on the node are your executables and a file that you assigned to the task, containing some data identifying, in your case e.g. which users should be processed by the task.
So suppose in your case that you need to perform the processing for 100 users. Each individual processing job is an execution of some executable you create, e.g. ProcessUserData.exe. As an example, suppose your executable takes, in addition to a userId, an argument specifying whether this should be performed in test or prod, so e.g. ProcessUserData.exe "path to file containing user ids to process" --environment test. We'll assume that your executable doesn't need other input than the user id and the environment in which to perform the processing.
When uploaded to the input container, you will receive a reference (ResourceFile) to each input file. One input file should be associated with one "Task" and each task is passed to an available node as the job executes.
The details of how to do this are clear from the getting started link, I'm trying to focus on the concepts, so I won't go into much detail.
You now create the tasks (CloudTask) to be executed, specify what it should run on the command line, and add them to the job. Here you reference the input file that each task should take as input. An example (assuming Windows cmd):
cmd /c %AZ_BATCH_NODE_SHARED_DIR%\ProcessUserdata.exe %AZ_BATCH_TASK_DIR%\userIds1.txt --environment test
Here, userIds1.txt is the filename your first ResourceFile returned when you uploaded the input files. The next command will specify userIds2.txt, etc.
When you've created your list of CloudTask objects containing the commands, you add them to the job, e.g in C#.
await batchClient.JobOperations.AddTaskAsync(jobId, tasks);
And now you wait for the job to finish.
What happens now is that Azure batch looks at the nodes in the pool and while there are more tasks in the tasks list, it assigns a task to an available (idle) node.
Once completed (which you can poll for through the API), you can delete the pool, the job and pay only for the compute that you've used.
One final note: Your tasks may depend on external packages, i.e. an execution environment that is not installed by default on the OS you've selected, so there are a few possible ways of resolving this: 1. Upload an application package, which will be distributed to each node as it enters the pool (again, there's an environment variable pointing to it). This can be done through the Azure Portal. 2. Use a command line tool to get what you need, e.g. apt-get install on Ubuntu.
Hope that gives you an overview of what Batch is. In my opinion the best way to get started is to do something very simple, i.e. print environment variables in a single task on a single node.
You can inspect the stdout and stderr of each node while the execution is underway, again through the portal.
There's obviously a lot more to it than this, but this is a basic guide. You can create linked tasks and a lot of other nifty things, but you can read up on that if you need it.
Upvotes: 3