Failure reading parquet files

Question

Azure ML fails to read tabular data set from parquet files, many parquet files.

Creating datasets

from azureml.data.datapath import DataPath
datastore_path = [DataPath(datastore, 'churn')]
tabular_dataset = Dataset.Tabular.from_parquet_files(path=datastore_path)

Ram · Accepted Answer

Add extensions: *.parquet:

from azureml.data.datapath import DataPath
datastore_path = [DataPath(datastore, 'churn/*.parquet')]
tabular_dataset = Dataset.Tabular.from_parquet_files(path=datastore_path)

Other ways to not read all data into memory at once would be to use skip() and take() on the TabularDataset to only request portions of source data at a time. Or to mount the Parquet files as a FileDataset and then construct separate TabularDataset for subsets of the files in your training script.

Here’s a sample notebook for your reference: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/parallel-run/tabular-dataset-inference-iris.ipynb

Failure reading parquet files

Answers (1)

Related Questions