Save/load pandas dataframe with custom attributes

Question

I have a pandas.DataFrame to which I've appended a some meta information, in the form of an attribute. I'd like to save/restore df with this in tact, but it gets erased in the saving process:

import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

df.my_attribute = 'can I recover this attribute after saving?'
df.to_pickle('test.pkl')
new_df = pd.read_pickle('test.pkl')
new_df.my_attribute

# AttributeError: 'DataFrame' object has no attribute 'my_attribute'

Other file formats appear to be worse: csv and json discard type, index or column information if you're not careful. Maybe create a new class that extends DataFrame? Open to ideas.

chrisb · Accepted Answer

There is no universal, or anything close-to, standard here, but there are a few options

1) General advice - I'd wouldn't use pickle for anything but the shortest of terms serialization (like <1 day)

2) Arbitrary metadata can be packed into two of the binary formats pandas supports, msgpack and HDF5, granted in an ad-hoc way. You could also do this we CSV, etc, but it becomes even more ad-hoc.

# msgpack
data = {'df': df, 'my_attribute': df.my_attribute}
pd.to_msgpack('tmp.msg', data)
pd.read_msgpack('tmp.msg')['my_attribute']
# Out[70]: 'can I recover this attribute after saving?'

# hdf
with pd.HDFStore('tmp.h5') as store:
    store.put('df', df)
    store.get_storer('df').attrs.my_attribute = df.my_attribute    
with pd.HDFStore('tmp.h5') as store:
    df = store.get('df')
    df.my_attribute = store.get_storer('df').attrs.my_attribute

df.my_attribute
Out[79]: 'can I recover this attribute after saving?'

3) xarray, which is a n-d extension of pandas support storing to the NetCDF file format, which has a more built-in notion of metadata

import xarray
ds = xarray.Dataset.from_dataframe(df)
ds.attrs['my_attribute'] = df.my_attribute

ds.to_netcdf('test.cdf')
ds = xarray.open_dataset('test.cdf')
ds
Out[8]: 

Dimensions:            (index: 150)
Coordinates:
  * index              (index) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
Data variables:
    sepal length (cm)  (index) float64 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 ...
    sepal width (cm)   (index) float64 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 ...
    petal length (cm)  (index) float64 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 ...
    petal width (cm)   (index) float64 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 ...
Attributes:
    my_attribute:  can I recover this attribute after saving?

Save/load pandas dataframe with custom attributes

Answers (2)

Related Questions