Reputation: 4605
I have created a Tabular Dataset using Azure ML python API. Data under question is a bunch of parquet files (~10K parquet files each of size of 330 KB) residing in Azure Data Lake Gen 2 spread across multiple partitions. When I try to load the dataset using the API TabularDataset.to_pandas_dataframe()
, it continues forever (hangs), if there are empty parquet files included in the Dataset. If the tabular dataset doesn't include those empty parquet files, TabularDataset.to_pandas_dataframe()
completes within few minutes.
By empty parquet file, I mean that the if I read the individual parquet file using pandas (pd.read_parquet()), it results in an empty DF (df.empty == True).
I discovered the root cause while working on another issue mentioned [here][1]
.
My question is how can make TabularDataset.to_pandas_dataframe()
work even when there are empty parquet files?
Update The issue has been fixed in the following version:
Upvotes: 1
Views: 779
Reputation: 71
Thanks for reporting it. This is a bug in handling of the parquet files with columns but empty row set. This has been fixed already and will be included in next release.
I could not repro the hang on multiple files, though, so if you could provide more info on that would be nice.
Upvotes: 1