Reputation: 25
Azure ML fails to read tabular data set from parquet files, many parquet files.
Creating datasets
from azureml.data.datapath import DataPath
datastore_path = [DataPath(datastore, 'churn')]
tabular_dataset = Dataset.Tabular.from_parquet_files(path=datastore_path)
Upvotes: 2
Views: 2276
Reputation: 2754
Add extensions: *.parquet:
from azureml.data.datapath import DataPath
datastore_path = [DataPath(datastore, 'churn/*.parquet')]
tabular_dataset = Dataset.Tabular.from_parquet_files(path=datastore_path)
Other ways to not read all data into memory at once would be to use skip()
and take()
on the TabularDataset to only request portions of source data at a time.
Or to mount the Parquet files as a FileDataset and then construct separate TabularDataset for subsets of the files in your training script.
Here’s a sample notebook for your reference: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/parallel-run/tabular-dataset-inference-iris.ipynb
Upvotes: 2