houseofleft
houseofleft

Reputation: 447

How to read/access custom parquet metadata saved with Dask

I'm currently using dask to save a parquet file, I'd like to store some additional information about the file within the parquet files metadata (as in, in the footer metadata, rather than in a global _metadata file).

Dask has a handy looking "custom_metadata" parameter which takes a dictionary and that I think can use like this:

import dask.dataframe as dd
import pandas as pd

df = dd.from_pandas(pd.DataFrame({'a':[1, 2], 'b':[3, 4]}), npartitions=2)
df.to_parquet('parquet_folder', custom_metadata={'something': 'something_else'})

Only problem is, I can't figure out how to actually access the metadata. I thought the following would work:

import pyarrow

metadata = pyarrow.parquet.read_metadata('parquet_folder/part.0.parquet')

Only thing is, looking at the metadata output (both as itself, and though metadata.to_dict()), it doesn't seem to have loaded the key/values I saved earlier.

Any idea how I can access the custom saved metadata? (I'm pretty keen to keep using dask for saving the file, but not too fussed on how I access the metadata itself)

Upvotes: 0

Views: 838

Answers (2)

joris
joris

Reputation: 139162

Using pyarrow, the custom key-value metadata is available in the metadata property of the FileMetaData object returned by read_metadata:

metadata = pyarrow.parquet.read_metadata('parquet_folder/part.0.parquet')
metadata.metadata

Upvotes: 1

mdurant
mdurant

Reputation: 28683

I cannot immediately see how to do it with pyarrow, but with fastparquet (which I know much better), it is:

pf = fastparquet.ParquetFile("parquet_folder/")
pf.key_value_metadata

Notice the ugly encoded arrow schema in there in addition to the original pandas one (and parquet itself has a required schema object too).

Upvotes: 1

Related Questions