Arnab Biswas
Arnab Biswas

Reputation: 4605

AzureML: TabularDataset.to_pandas_dataframe() hangs when parquet file is empty

I have created a Tabular Dataset using Azure ML python API. Data under question is a bunch of parquet files (~10K parquet files each of size of 330 KB) residing in Azure Data Lake Gen 2 spread across multiple partitions. When I try to load the dataset using the API TabularDataset.to_pandas_dataframe(), it continues forever (hangs), if there are empty parquet files included in the Dataset. If the tabular dataset doesn't include those empty parquet files, TabularDataset.to_pandas_dataframe() completes within few minutes.

By empty parquet file, I mean that the if I read the individual parquet file using pandas (pd.read_parquet()), it results in an empty DF (df.empty == True).

I discovered the root cause while working on another issue mentioned [here][1].

My question is how can make TabularDataset.to_pandas_dataframe() work even when there are empty parquet files?

Update The issue has been fixed in the following version:

Upvotes: 1

Views: 779

Answers (1)

Andrei Liakhovich
Andrei Liakhovich

Reputation: 71

Thanks for reporting it. This is a bug in handling of the parquet files with columns but empty row set. This has been fixed already and will be included in next release.

I could not repro the hang on multiple files, though, so if you could provide more info on that would be nice.

Upvotes: 1

Related Questions