Fabian Hertwig
Fabian Hertwig

Reputation: 1123

AzureML create dataset from datastore with multiple files - path not valid

I am trying to create a dataset in Azure ML where the data source are multiple files (eg images) in a Blob Storage. How do you do that correctly?

Here is the error I get following the documented approach in the UI

When I create the dataset in the UI and select the blob storage and directory with either just dirname or dirname/** then the files can not be found in the explorer tab with the error ScriptExecution.StreamAccess.NotFound: The provided path is not valid or the files could not be accessed. When I try to download the data with the code snippet in the consume tab then I get the error:

from azureml.core import Workspace, Dataset

# set variables 

workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='teststar')
dataset.download(target_path='.', overwrite=False)
Error Message: ScriptExecutionException was caused by StreamAccessException.
  StreamAccessException was caused by NotFoundException.
    Found no resources for the input provided: 'https://mystoragename.blob.core.windows.net/data/testdata/**'

When I just select one of the files instead of dirname or dirname/** then everything works. Does AzureML actually support Datasets consisting of multiple files?

Here is my setup:

I have a Data Storage with one container data. In there is a directory testdata containing testfile1.txt and testfile2.txt.

In AzureML I created a datastore testdatastore and there I select the data container in my data storage.

Then in Azure ML I create a Dataset from datastore, select file dataset and the datastore above. Then I can browse the files, select a folder and select that files in subdirectories should be included. This then creates the path testdata/** which does not work as described above.

I got the same issue when creating the dataset and datastore in python:

import azureml.core
from azureml.core import Workspace, Datastore, Dataset

ws = Workspace.from_config()

datastore = Datastore(ws, "mydatastore")

datastore_paths = [(datastore, 'testdata')]
test_ds = Dataset.File.from_files(path=datastore_paths)
test_ds.register(ws, "testpython")

Upvotes: 0

Views: 5710

Answers (2)

Fabian Hertwig
Fabian Hertwig

Reputation: 1123

I uploaded and registered the files with this script and everything works as expected.

from azureml.core import Datastore, Dataset, Workspace

import logging

logger = logging.getLogger(__name__)
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)

datastore_name = "mydatastore"
dataset_path_on_disk = "./data/images_greyscale"
dataset_path_in_datastore = "images_greyscale"

azure_dataset_name = "images_grayscale"
azure_dataset_description = "dataset transformed into the coco format and into grayscale images"


workspace = Workspace.from_config()
datastore = Datastore.get(workspace, datastore_name=datastore_name)

logger.info("Uploading data...")
datastore.upload(
    src_dir=dataset_path_on_disk, target_path=dataset_path_in_datastore, overwrite=False
)
logger.info("Uploading data done.")

logger.info("Registering dataset...")
datastore_path = [(datastore, dataset_path_in_datastore)]
dataset = Dataset.File.from_files(path=datastore_path)
dataset.register(
    workspace=workspace,
    name=azure_dataset_name,
    description=azure_dataset_description,
    create_new_version=True,
)
logger.info("Registering dataset done.")

Upvotes: 1

Andrei Liakhovich
Andrei Liakhovich

Reputation: 71

Datasets definitely support multiple files, so your problem is almost certainly in the permissions given when creating "mydatastore" datastore (I suspect you have used SAS token to create this datastore). In order to be able to access anything but individual files you need to give list permissions to the datastore. This would not be a problem if you register datastore with account key, but could be a limitation of the access token. The second part of the provided path is not valid or the files could not be accessed refers to potential permission issues. You can also verify that folder/** syntax works by creating dataset from defaultblobstore that was provisioned for you with your ml workspace.

Upvotes: 2

Related Questions