AzureML create dataset from datastore with multiple files - path not valid

Question

I am trying to create a dataset in Azure ML where the data source are multiple files (eg images) in a Blob Storage. How do you do that correctly?

Here is the error I get following the documented approach in the UI

When I create the dataset in the UI and select the blob storage and directory with either just dirname or dirname/** then the files can not be found in the explorer tab with the error ScriptExecution.StreamAccess.NotFound: The provided path is not valid or the files could not be accessed. When I try to download the data with the code snippet in the consume tab then I get the error:

from azureml.core import Workspace, Dataset

# set variables 

workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='teststar')
dataset.download(target_path='.', overwrite=False)

Error Message: ScriptExecutionException was caused by StreamAccessException.
  StreamAccessException was caused by NotFoundException.
    Found no resources for the input provided: 'https://mystoragename.blob.core.windows.net/data/testdata/**'

When I just select one of the files instead of dirname or dirname/** then everything works. Does AzureML actually support Datasets consisting of multiple files?

Here is my setup:

I have a Data Storage with one container data. In there is a directory testdata containing testfile1.txt and testfile2.txt.

In AzureML I created a datastore testdatastore and there I select the data container in my data storage.

Then in Azure ML I create a Dataset from datastore, select file dataset and the datastore above. Then I can browse the files, select a folder and select that files in subdirectories should be included. This then creates the path testdata/** which does not work as described above.

I got the same issue when creating the dataset and datastore in python:

import azureml.core
from azureml.core import Workspace, Datastore, Dataset

ws = Workspace.from_config()

datastore = Datastore(ws, "mydatastore")

datastore_paths = [(datastore, 'testdata')]
test_ds = Dataset.File.from_files(path=datastore_paths)
test_ds.register(ws, "testpython")

Fabian Hertwig · Accepted Answer

I uploaded and registered the files with this script and everything works as expected.

from azureml.core import Datastore, Dataset, Workspace

import logging

logger = logging.getLogger(__name__)
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)

datastore_name = "mydatastore"
dataset_path_on_disk = "./data/images_greyscale"
dataset_path_in_datastore = "images_greyscale"

azure_dataset_name = "images_grayscale"
azure_dataset_description = "dataset transformed into the coco format and into grayscale images"


workspace = Workspace.from_config()
datastore = Datastore.get(workspace, datastore_name=datastore_name)

logger.info("Uploading data...")
datastore.upload(
    src_dir=dataset_path_on_disk, target_path=dataset_path_in_datastore, overwrite=False
)
logger.info("Uploading data done.")

logger.info("Registering dataset...")
datastore_path = [(datastore, dataset_path_in_datastore)]
dataset = Dataset.File.from_files(path=datastore_path)
dataset.register(
    workspace=workspace,
    name=azure_dataset_name,
    description=azure_dataset_description,
    create_new_version=True,
)
logger.info("Registering dataset done.")

AzureML create dataset from datastore with multiple files - path not valid

Here is the error I get following the documented approach in the UI

Here is my setup:

Answers (2)

Related Questions