Lukas Lötters
Lukas Lötters

Reputation: 61

Clean Up Azure Machine Learning Blob Storage

I manage a frequently used Azure Machine Learning workspace. With several Experiments and active pipelines. Everything is working good so far. My problem is to get rid of old data from runs, experiments and pipelines. Over the last year the blob storage grew to enourmus size, because every pipeline data is stored.

I have deleted older runs from experimnents by using the gui, but the actual pipeline data on the blob store is not deleted. Is there a smart way to clean up data on the blob store from runs which have been deleted ?

On one of the countless Microsoft support pages, I found the following not very helpfull post:

*Azure does not automatically delete intermediate data written with OutputFileDatasetConfig. To avoid storage charges for large amounts of unneeded data, you should either:

  1. Programmatically delete intermediate data at the end of a pipeline run, when it is no longer needed
  2. Use blob storage with a short-term storage policy for intermediate data (see Optimize costs by automating Azure Blob Storage access tiers)
  3. Regularly review and delete no-longer-needed data*

https://learn.microsoft.com/en-us/azure/machine-learning/how-to-move-data-in-out-of-pipelines#delete-outputfiledatasetconfig-contents-when-no-longer-needed

Any idea is welcome.

Upvotes: 4

Views: 1707

Answers (2)

Nema Sobhani
Nema Sobhani

Reputation: 51

Currently facing this exact problem. The most sensible approach is to enforce retention schedules at the storage account level. These are the steps you can follow:

  • Identify which storage account is linked to your AML instance and pull it up in the azure portal.
  • Under Settings / Configuration, ensure you are using StorageV2 (which has the desired functionality)
  • Under Data management / Lifecycle management, create a new rule that targets your problem containers.

NOTE - I do not recommend a blanket enforcement policy against the entire storage account, because any registered datasets, models, compute info, notebooks, etc will all be target for deletion as well. Instead, use the prefix arguments to declare relevant paths such as: storageaccount1234 / azureml / ExperimentRun

Here is the documentation on Lifecycle management:
https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview?tabs=azure-portal

Upvotes: 0

frictionlesspulley
frictionlesspulley

Reputation: 12368

Have you tried applying an azure storage account management policy on the said storage account ?

You could either change the tier of the blob from hot -> cold -> archive and thereby reduce costs or even configure a auto delete policy after a set number of days

Reference : https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview#sample-rule

If you use terraform to manage your resources this should be available a

Reference : https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_management_policy

resource "azurerm_storage_management_policy" "example" {
  storage_account_id = "<azureml-storage-account-id>"
  rule {
    name    = "rule2"
    enabled = false
    filters {
      prefix_match = ["pipeline"]
    }
    actions {
      base_blob {
        delete_after_days_since_modification_greater_than          = 90
      }
    }
  }
}

Similar option is available via the portal settings as well. Hope this helps!

Upvotes: 0

Related Questions