ilpomo
ilpomo

Reputation: 667

Python Dask 'Metadata parse failed' error when trying to read_parquet

I have a data folder containing multiple .parquet files, those file were .csv file converted using pandas and pyarrow. All files have a datetime index named 'Timestamp', and five columns named 'Open', 'High', 'Low', 'Close', 'Volume', all with the same dtype=int32.

I just want to load them all in a single Dask.Dataframe. Here's a code snippet.

import os
import pyarrow as pa
import pyarrow.parquet as pq
import dask.dataframe as dd


user = os.getlogin()
data_path = 'C:\\Users\\%s\\data\\' % user

ds = dd.read_parquet(data_path)  # error
ds = dd.read_parquet(data_path, index='Timestamp')  # error

But doing so it returns me the error 'fastparquet.util.ParquetException: Metadata parse failed: # data_path'

So I tried to access manually the single files' metadata.

import glob

files = ['%s' % s for s in glob.glob(data_path + '*.parquet')]

for file in files:
    print(pq.read_metadata(file))  # no error returned

ds = dd.read_parquet(files)  # error
ds = dd.read_parquet(files, index='Timestamp')  # error

What's wrong?

Upvotes: 2

Views: 2132

Answers (1)

mdurant
mdurant

Reputation: 28684

To read the data with arrow and not fastparquet, you will need

ds = dd.read_parquet(data_path, engine='arrow')

which since the data was written by arrow should work.

That the data does not load with fastparquet is concerning and probably a bug. I note that you are using Windows paths, so this may be the problem. I would encourage you to try with https://github.com/dask/fastparquet/pull/232 , to see if it fixes things for you.

Upvotes: 3

Related Questions