Reputation: 667
I have a data folder containing multiple .parquet files, those file were .csv file converted using pandas and pyarrow. All files have a datetime index named 'Timestamp', and five columns named 'Open', 'High', 'Low', 'Close', 'Volume', all with the same dtype=int32.
I just want to load them all in a single Dask.Dataframe. Here's a code snippet.
import os
import pyarrow as pa
import pyarrow.parquet as pq
import dask.dataframe as dd
user = os.getlogin()
data_path = 'C:\\Users\\%s\\data\\' % user
ds = dd.read_parquet(data_path) # error
ds = dd.read_parquet(data_path, index='Timestamp') # error
But doing so it returns me the error 'fastparquet.util.ParquetException: Metadata parse failed: # data_path'
So I tried to access manually the single files' metadata.
import glob
files = ['%s' % s for s in glob.glob(data_path + '*.parquet')]
for file in files:
print(pq.read_metadata(file)) # no error returned
ds = dd.read_parquet(files) # error
ds = dd.read_parquet(files, index='Timestamp') # error
What's wrong?
Upvotes: 2
Views: 2132
Reputation: 28684
To read the data with arrow and not fastparquet, you will need
ds = dd.read_parquet(data_path, engine='arrow')
which since the data was written by arrow should work.
That the data does not load with fastparquet is concerning and probably a bug. I note that you are using Windows paths, so this may be the problem. I would encourage you to try with https://github.com/dask/fastparquet/pull/232 , to see if it fixes things for you.
Upvotes: 3