anon01
anon01

Reputation: 11171

Save/load pandas dataframe with custom attributes

I have a pandas.DataFrame to which I've appended a some meta information, in the form of an attribute. I'd like to save/restore df with this in tact, but it gets erased in the saving process:

import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

df.my_attribute = 'can I recover this attribute after saving?'
df.to_pickle('test.pkl')
new_df = pd.read_pickle('test.pkl')
new_df.my_attribute

# AttributeError: 'DataFrame' object has no attribute 'my_attribute'

Other file formats appear to be worse: csv and json discard type, index or column information if you're not careful. Maybe create a new class that extends DataFrame? Open to ideas.

Upvotes: 5

Views: 2531

Answers (2)

anon01
anon01

Reputation: 11171

I wanted to store a small amount of metadata with my dataframe when I asked this question. Monkey-patching information is maybe the worst option :). If I faced this issue today, I would probably do one of the following:

  1. use plaintext/markdown (readme, easiest and preferable)
  2. use json if I want a little bit of structure (easy, minimal change to flow)
  3. reach for "production grade" tooling (e.g. sqlite/hdf5/parquet) if this was going to be more serious

json is particularly good as both human and machine readable/editable format. One option would be to store a json metadata file:

metadata.json:

[
    {
        "path": "df.pkl",
        "metadata": "some editable metadata string"
    },
    {
        "path": "some/path/to/df2.pkl",
        "metadata": "metadata for df2"
    },
]

You can even parse this into a df:

df_meta = pd.read_json("metadata.json")

Upvotes: 0

chrisb
chrisb

Reputation: 52276

There is no universal, or anything close-to, standard here, but there are a few options

1) General advice - I'd wouldn't use pickle for anything but the shortest of terms serialization (like <1 day)

2) Arbitrary metadata can be packed into two of the binary formats pandas supports, msgpack and HDF5, granted in an ad-hoc way. You could also do this we CSV, etc, but it becomes even more ad-hoc.

# msgpack
data = {'df': df, 'my_attribute': df.my_attribute}
pd.to_msgpack('tmp.msg', data)
pd.read_msgpack('tmp.msg')['my_attribute']
# Out[70]: 'can I recover this attribute after saving?'

# hdf
with pd.HDFStore('tmp.h5') as store:
    store.put('df', df)
    store.get_storer('df').attrs.my_attribute = df.my_attribute    
with pd.HDFStore('tmp.h5') as store:
    df = store.get('df')
    df.my_attribute = store.get_storer('df').attrs.my_attribute

df.my_attribute
Out[79]: 'can I recover this attribute after saving?'

3) xarray, which is a n-d extension of pandas support storing to the NetCDF file format, which has a more built-in notion of metadata

import xarray
ds = xarray.Dataset.from_dataframe(df)
ds.attrs['my_attribute'] = df.my_attribute

ds.to_netcdf('test.cdf')
ds = xarray.open_dataset('test.cdf')
ds
Out[8]: 
<xarray.Dataset>
Dimensions:            (index: 150)
Coordinates:
  * index              (index) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
Data variables:
    sepal length (cm)  (index) float64 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 ...
    sepal width (cm)   (index) float64 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 ...
    petal length (cm)  (index) float64 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 ...
    petal width (cm)   (index) float64 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 ...
Attributes:
    my_attribute:  can I recover this attribute after saving?

Upvotes: 4

Related Questions