Reputation: 7265

Adding meta-information/metadata to pandas DataFrame

Is it possible to add some meta-information/metadata to a pandas DataFrame?

For example, the instrument's name used to measure the data, the instrument responsible, etc.

One workaround would be to create a column with that information, but it seems wasteful to store a single piece of information in every row!

Upvotes: 163

Answers (13)

unutbu

Reputation: 880259

Sure, like most Python objects, you can attach new attributes to a pandas.DataFrame:

import pandas as pd
df = pd.DataFrame([])
df.instrument_name = 'Binky'

Note, however, that while you can attach attributes to a DataFrame, operations performed on the DataFrame (such as groupby, pivot, join, assign or loc to name just a few) may return a new DataFrame without the metadata attached. Pandas does not yet have a robust method of propagating metadata attached to DataFrames.

Preserving the metadata in a file is possible. You can find an example of how to store metadata in an HDF5 file here.

Upvotes: 113

Olshansky

Reputation: 6414

For those looking to store the datafram in an HDFStore, according to pandas.pydata.org, the recommended approach is:

import pandas as pd

df = pd.DataFrame(dict(keys=['a', 'b', 'c'], values=['1', '2', '3']))
df.to_hdf('/tmp/temp_df.h5', key='temp_df')
store = pd.HDFStore('/tmp/temp_df.h5') 
store.get_storer('temp_df').attrs.attr_key = 'attr_value'
store.close()

Upvotes: 1

keepAlive

Reputation: 6665

Referring to the section Define original properties^{(of the official Pandas documentation)} and if subclassing from pandas.DataFrame is an option, note that:

To let original data structures have additional properties, you should let pandas know what properties are added.

Thus, something you can do - where the name MetaedDataFrame is arbitrarily chosen - is

class MetaedDataFrame(pd.DataFrame):
    """s/e."""
    _metadata = ['instrument_name']

    @property
    def _constructor(self):
        return self.__class__

    # Define the following if providing attribute(s) at instantiation
    # is a requirement, otherwise, if YAGNI, don't.
    def __init__(
        self, *args, instrument_name: str = None, **kwargs
    ):
        super().__init__(*args, **kwargs)
        self.instrument_name = instrument_name

And then instantiate your dataframe with your (_metadata-prespecified) attribute(s)

>>> mdf = MetaedDataFrame(instrument_name='Binky')
>>> mdf.instrument_name
'Binky'

Or even after instantiation

>>> mdf = MetaedDataFrame()
>>> mdf.instrument_name = 'Binky'
'Binky'

Without any kind of warning (as of 2021/06/15): serialization and ~.copy work like a charm. Also, such approach allows to enrich your API, e.g. by adding some instrument_name-based members to MetaedDataFrame, such as properties (or methods):

    [...]
    
    @property
    def lower_instrument_name(self) -> str:
        if self.instrument_name is not None:
            return self.instrument_name.lower()

    [...]

>>> mdf.lower_instrument_name
'binky'

... but this is rather beyond the scope of this question ...

Upvotes: 5

Jon

Reputation: 1142

Adding raw attributes with pandas (e.g. df.my_metadata = "source.csv") is not a good idea.

Even on the latest version (1.2.4 on python 3.8), doing this will randomly cause segfaults when doing very simple operations with things like read_csv. It will be hard to debug, because read_csv will work fine, but later on (seemingly at random) you will find that the dataframe has been freed from memory.

It seems cpython extensions involved with pandas seem to make very explicit assumptions about the data layout of the dataframe.

attrs is the only safe way to use metadata properties currently: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.attrs.html

e.g.

df.attrs.update({'my_metadata' : "source.csv"})

How attrs should behave in all scenarios is not fully fleshed out. You can help provide feedback on the expected behaviors of attrs in this issue: https://github.com/pandas-dev/pandas/issues/28283

Upvotes: 1

ryanjdillon

Reputation: 18978

As of pandas 1.0, possibly earlier, there is now a Dataframe.attrs property. It is experimental, but this is probably what you'll want in the future. For example:

import pandas as pd
df = pd.DataFrame([])
df.attrs['instrument_name'] = 'Binky'

Find it in the docs here.

Trying this out with to_parquet and then from_parquet, it doesn't seem to persist, so be sure you check that out with your use case.

Upvotes: 92

DisplayName

Reputation: 249

I have been looking for a solution and found that pandas frame has the property attrs

pd.DataFrame().attrs.update({'your_attribute' : 'value'})
frame.attrs['your_attribute']

This attribute will always stick to your frame whenever you pass it!

Upvotes: 5

SenAnan

Reputation: 276

I was having the same issue and used a workaround of creating a new, smaller DF from a dictionary with the metadata:

    meta = {"name": "Sample Dataframe", "Created": "19/07/2019"}
    dfMeta = pd.DataFrame.from_dict(meta, orient='index')

This dfMeta can then be saved alongside your original DF in pickle etc

See Saving and loading multiple objects in pickle file? (Lutz's answer) for excellent answer on saving and retrieving multiple dataframes using pickle

Upvotes: 2

bscan

Reputation: 3046

The top answer of attaching arbitrary attributes to the DataFrame object is good, but if you use a dictionary, list, or tuple, it will emit an error of "Pandas doesn't allow columns to be created via a new attribute name". The following solution works for storing arbitrary attributes.

from types import SimpleNamespace
df = pd.DataFrame()
df.meta = SimpleNamespace()
df.meta.foo = [1,2,3]

Upvotes: 11

jtwilson

Reputation: 325

As mentioned by @choldgraf I have found xarray to be an excellent tool for attaching metadata when comparing data and plotting results between several dataframes.

In my work, we are often comparing the results of several firmware revisions and different test scenarios, adding this information is as simple as this:

df = pd.read_csv(meaningless_test)
metadata = {'fw': foo, 'test_name': bar, 'scenario': sc_01}
ds = xr.Dataset.from_dataframe(df)
ds.attrs = metadata

Upvotes: 8

Dennis Golomazov

Reputation: 17349

As mentioned in other answers and comments, _metadata is not a part of public API, so it's definitely not a good idea to use it in a production environment. But you still may want to use it in a research prototyping and replace it if it stops working. And right now it works with groupby/apply, which is helpful. This is an example (which I couldn't find in other answers):

df = pd.DataFrame([1, 2, 2, 3, 3], columns=['val']) 
df.my_attribute = "my_value"
df._metadata.append('my_attribute')
df.groupby('val').apply(lambda group: group.my_attribute)

Output:

val
1    my_value
2    my_value
3    my_value
dtype: object

Upvotes: 7

choldgraf

Reputation: 3689

Coming pretty late to this, I thought this might be helpful if you need metadata to persist over I/O. There's a relatively new package called h5io that I've been using to accomplish this.

It should let you do a quick read/write from HDF5 for a few common formats, one of them being a dataframe. So you can, for example, put a dataframe in a dictionary and include metadata as fields in the dictionary. E.g.:

save_dict = dict(data=my_df, name='chris', record_date='1/1/2016')
h5io.write_hdf5('path/to/file.hdf5', save_dict)
in_data = h5io.read_hdf5('path/to/file.hdf5')
df = in_data['data']
name = in_data['name']
etc...

Another option would be to look into a project like xray, which is more complex in some ways, but I think it does let you use metadata and is pretty easy to convert to a DataFrame.

Upvotes: 4

follyroof

Reputation: 3540

Just ran into this issue myself. As of pandas 0.13, DataFrames have a _metadata attribute on them that does persist through functions that return new DataFrames. Also seems to survive serialization just fine (I've only tried json, but I imagine hdf is covered as well).

Upvotes: 15

Matti John

Reputation: 20507

Not really. Although you could add attributes containing metadata to the DataFrame class as @unutbu mentions, many DataFrame methods return a new DataFrame, so your meta data would be lost. If you need to manipulate your dataframe, then the best option would be to wrap your metadata and DataFrame in another class. See this discussion on GitHub: https://github.com/pydata/pandas/issues/2485

There is currently an open pull request to add a MetaDataFrame object, which would support metadata better.

Upvotes: 14

Adding meta-information/metadata to pandas DataFrame

Answers (13)

Related Questions