Reputation: 6263

Save additional attributes in Pandas Dataframe

I recall from my MatLab days using structured arrays wherein you could store different data as an attribute of the main structure. Something like:

a = {}
a.A = magic(10)
a.B = magic(50); etc.

where a.A and a.B are completely separate from each other allowing you to store different types within a and operate on them as desired. Pandas allows us to do something similar, but not quite the same.

I am using Pandas and want to store attributes of a dataframe without actually putting it within a dataframe. This can be done via:

import pandas as pd

a = pd.DataFrame(data=pd.np.random.randint(0,100,(10,5)),columns=list('ABCED')

# now store an attribute of <a>
a.local_tz = 'US/Eastern'

Now, the local timezone is stored in a, but I cannot save this attribute when I save the dataframe (i.e. after re-loading a there is no a.local_tz). Is there a way to save these attributes?

Currently, I am just making new columns in the dataframe to hold information like timezone, latitude, longituded, etc., but this seems to be a bit of a waste. Further, when I do analysis on the data I run into problems of having to exclude these other columns.

################## BEGIN EDIT ##################

Using unutbu's advice, I now store the data in h5 format. As mentioned, loading metadata back in as attributes of the dataframe is risky. However, since I am the creator of these files (and the processing algorithms) I can choose what is stored as metadata and what is not. When processing the data that will go into the h5 files, I choose to store the metadata in a dictionary that is initialized as an attribute of my classes. I made a simple IO class to import the h5 data, and made the metadata as class attributes. Now I can work on my dataframes without risk of losing the metadata.

class IO():
    def __init__(self):
        self.dtfrmt = 'dummy_str'

    def h5load(self,filename,update=False):
        '''h5load loads the stored HDF5 file.  Both the dataframe (actual data) and 
        the associated metadata are stored in the H5file

        NOTE: This does not load "any" H5 
        file, it loads H5 files specifically created to hold dataframe data and 
        metadata.

        When multi-indexed dataframes are stored in the H5 format the date 
        values (previously initialized with timezone information) lose their
        timezone localization.  Therefore, <h5load> re-localizes the 'DATE' 
        index as UTC.

        Parameters
        ----------
        filename : string/path
            path and filename of H5 file to be loaded.  H5 file must have been 
            created using <h5store> below.

        udatedf : boolean True/False
            default: False
            If the selected dataframe is to be updated then it is imported 
            slightly different.  If update==True, the <metadata> attribute is
            returned as a dictionary and <data> is returned as a dataframe 
            (i.e., as a stand-alone dictionary with no attributes, and NOT an 
            instance of the IO() class).  Otherwise, if False, <metadata> is 
            returned as an attribute of the class instance.

        Output
        ------
        data : Pandas dataframe with attributes
            The dataframe contains only the data as collected by the instrument.  
            Any metadata (e.g. timezone, scaling factor, basically anything that
            is constant throughout the file) is stored as an attribute (e.g. lat 
            is stored as <data.lat>).'''

        with pd.HDFStore(filename,'r') as store:
            self.data = store['mydata']
            self.metadata = store.get_storer('mydata').attrs.metadata    # metadata gets stored as attributes, so no need to make <metadata> an attribute of <self>

            # put metadata into <data> dataframe as attributes
            for r in self.metadata:
                setattr(self,r,self.metadata[r])

        # unscale data
        self.data, self.metadata = unscale(self.data,self.metadata,stringcols=['routine','date'])

        # when pandas stores multi-index dataframes as H5 files the timezone
        # initialization is lost.  Remake index with timezone initialized: only
        # for multi-indexed dataframes
        if isinstance(self.data.index,pd.core.index.MultiIndex):
            # list index-level names, and identify 'DATE' level
            namen = self.data.index.names
            date_lev = namen.index('DATE')

            # extract index as list and remake tuples with timezone initialized
            new_index = pd.MultiIndex.tolist(self.data.index)
            for r in xrange( len(new_index) ):
                tmp = list( new_index[r] )
                tmp[date_lev] = utc.localize( tmp[date_lev] )

                new_index[r] = tuple(tmp)

            # reset multi-index
            self.data.index = pd.MultiIndex.from_tuples( new_index, names=namen )


        if update:
            return self.metadata, self.data
        else:
            return self





    def h5store(self,data, filename, **kwargs):
        '''h5store stores the dataframe as an HDF5 file.  Both the dataframe 
        (actual data) and the associated metadata are stored in the H5file

        Parameters
        ----------
        data : Pandas dataframe NOT a class instance
            Must be a dataframe, not a class instance (i.e. cannot be an instance 
            named <data> that has an attribute named <data> (e.g. the Pandas 
            data frame is stored in data.data)).  If the dataframe is under
            data.data then the input variable must be data.data.

        filename : string/path
            path and filename of H5 file to be loaded.  H5 file must have been 
            created using <h5store> below.

        **kwargs : dictionary
            dictionary containing metadata information.


        Output
        ------
        None: only saves data to file'''

        with pd.HDFStore(filename,'w') as store:
            store.put('mydata',data)
            store.get_storer('mydata').attrs.metadata = kwargs

H5 files are then loaded via data = IO().h5load('filename.h5') the dataframe is stored under data.data I retain the metadata dictionary under data.metadata and have created individual metadata attributes (e.g. data.lat created from data.metadata['lat']).

My index time stamps are localized to pytz.utc(). However, when a multi-indexed dataframe is stored to h5 the timezone localization is lost (using Pandas 15.2), so I correct for this in IO().h5load.

Upvotes: 25

Answers (3)

JohnE

Reputation: 30434

Pandas now has an attribute, "attrs", for storing metadata. As noted in the documentation, it is experimental and subject to being changed in the future. There is not much else in the documentation, and while it's quite limited, it's also very easy to access and use:

df = pd.DataFrame({'fruit':['apple','pear','banana'], 'price':[3,4,2]})

df.attrs = { 'description':'fruit prices in dollars per pound' }

When you create "df", "df.attrs" is initialized as an empty dictionary. Beyond that, it appears you can store whatever you want in "df.attrs", as long as the top level is a dictionary. Above, I just stored a description of the dataframe but another option is to store column labels (e.g. see my answer here: How to handle meta data associated with a pandas dataframe? )

Persistence appears limited to a small number of dataframe operations. For example, the values in "attrs" are retained with copy, loc, and iloc but not with groupby. One can of course always reattach via a simple assignment. For example, if you create a dataframe called "grouped_means" via a groupby operation, it will not retain "attrs", but you can easily reattach with:

grouped_means.attrs = df.attrs

And perhaps that could be made automatic in future versions of pandas. So it is a pretty limited feature, but is very easy to use and might be improved in the future.

Upvotes: 6

unutbu

Reputation: 880269

There is an open issue regarding the storage of custom metadata in NDFrames. But due to the multitudinous ways pandas functions may return DataFrames, the _metadata attribute is not (yet) preserved in all situations.

For the time being, you'll just have to store the metadata in an auxilliary variable.

There are multiple options for storing DataFrames + metadata to files, depending on what format you wish to use -- pickle, JSON, HDF5 are all possibilities.

Here is how you could store and load a DataFrame with metadata using HDF5. The recipe for storing the metadata comes from the Pandas Cookbook.

import numpy as np
import pandas as pd

def h5store(filename, df, **kwargs):
    store = pd.HDFStore(filename)
    store.put('mydata', df)
    store.get_storer('mydata').attrs.metadata = kwargs
    store.close()

def h5load(store):
    data = store['mydata']
    metadata = store.get_storer('mydata').attrs.metadata
    return data, metadata

a = pd.DataFrame(
    data=pd.np.random.randint(0, 100, (10, 5)), columns=list('ABCED'))

filename = '/tmp/data.h5'
metadata = dict(local_tz='US/Eastern')
h5store(filename, a, **metadata)
with pd.HDFStore(filename) as store:
    data, metadata = h5load(store)

print(data)
#     A   B   C   E   D
# 0   9  20  92  43  25
# 1   2  64  54   0  63
# 2  22  42   3  83  81
# 3   3  71  17  64  53
# 4  52  10  41  22  43
# 5  48  85  96  72  88
# 6  10  47   2  10  78
# 7  30  80   3  59  16
# 8  13  52  98  79  65
# 9   6  93  55  40   3

print(metadata)

yields

{'local_tz': 'US/Eastern'}

Upvotes: 38

TheBlackCat

Reputation: 10318

The approach I use is to add additional MultiIndex levels to store the additional information I want (I use the columns, but either would work). All the columns have the same values for these additional parameters. This is also useful since I can combine multiple dataframes or split out individual columns and these values are preserved.

>>> col=pd.MultiIndex.from_product([['US/Eastern'], ['A', 'B', 'C', 'E', 'D']], names=['local_tz', 'name'])
>>> a = pd.DataFrame(data=pd.np.random.randint(0,100,(10,5)),columns=col)
>>> print(a)
local_tz US/Eastern                
name              A   B   C   E   D
0                38  93  63  24  55
1                21  25  84  98  62
2                 4  60  78   0   5
3                26  50  82  89  23
4                32  70  80  90   1
5                 6  17   8  60  59
6                95  98  69  19  76
7                71  90  45  45  40
8                94  16  44  60  16
9                53   8  30   4  72

Upvotes: 7

Save additional attributes in Pandas Dataframe

Answers (3)

Related Questions