Reputation: 530
I have a multiple parquet files in difference directories
paths = ['adl://entrofi/shift/20190725_060500_20190928_060500/*.parquet',
'adl://entrofi/shift/20190726_060500_20190928_060500/*.parquet',
'adl://entrofi/shift/20190727_060500_20190928_060500/*.parquet',
'adl://entrofi/shift/20190728_060500_20190928_060500/*.parquet',
'adl://entrofi/shift/20190820_060500_20190920_060500/*.parquet',
'adl://entrofi/shift/20190828_060500_20190928_060500/*.parquet']
Each file contains columns A,B,C
I wanna read all this files so I do a
ddf = dd.read_parquet(paths).drop_duplicates()
However, ddf
contains columns A,B, C and dir0
. dir0
contains names of the folders
from which each path in paths
was read.
Reading each individual file in paths
contains no dir0
columns.
How do I avoid the addition of dir0
automatically to my ddf
?
Upvotes: 2
Views: 408
Reputation: 28684
This is the expected behaviour with the fastparquet backend, because it looks like your files are partitioned by folder-name, in this case using the "drill" scheme (as opposed to field=value
directory names).
To avoid it, you could use the pyarrow engine, or simply specify the columns that you would like to keep:
ddf = dd.read_parquet(paths, columns=['A', 'B', 'C'])
ddf = dd.read_parquet(paths, engine='pyarrow')
Upvotes: 4