theSekyi
theSekyi

Reputation: 530

Dask read_parquet adds an extra column dir0

I have a multiple parquet files in difference directories

paths = ['adl://entrofi/shift/20190725_060500_20190928_060500/*.parquet',
'adl://entrofi/shift/20190726_060500_20190928_060500/*.parquet',
'adl://entrofi/shift/20190727_060500_20190928_060500/*.parquet',
'adl://entrofi/shift/20190728_060500_20190928_060500/*.parquet',
'adl://entrofi/shift/20190820_060500_20190920_060500/*.parquet',
'adl://entrofi/shift/20190828_060500_20190928_060500/*.parquet']

Each file contains columns A,B,C

I wanna read all this files so I do a

ddf = dd.read_parquet(paths).drop_duplicates()

However, ddf contains columns A,B, C and dir0. dir0 contains names of the folders from which each path in paths was read.

Reading each individual file in paths contains no dir0 columns.

How do I avoid the addition of dir0 automatically to my ddf?

Upvotes: 2

Views: 408

Answers (1)

mdurant
mdurant

Reputation: 28684

This is the expected behaviour with the fastparquet backend, because it looks like your files are partitioned by folder-name, in this case using the "drill" scheme (as opposed to field=value directory names).

To avoid it, you could use the pyarrow engine, or simply specify the columns that you would like to keep:

ddf = dd.read_parquet(paths, columns=['A', 'B', 'C'])
ddf = dd.read_parquet(paths, engine='pyarrow')

Upvotes: 4

Related Questions