hendriksc1
hendriksc1

Reputation: 59

Attach data source info to pandas series

Is there a way to attach information on the data source to a pandas series? At the moment I just add columns to the dataframe that indicate the source for each variable...

Thanks a lot for ideas and suggestions!

Upvotes: 3

Views: 376

Answers (2)

Xukrao
Xukrao

Reputation: 8634

From the offical pandas documentation:

To let original data structures have additional properties, you should let pandas know what properties are added. pandas maps unknown properties to data names overriding __getattribute__. Defining original properties can be done in one of 2 ways:

  1. Define _internal_names and _internal_names_set for temporary properties which WILL NOT be passed to manipulation results.

  2. Define _metadata for normal properties which will be passed to manipulation results.

Below is an example to define two original properties, “internal_cache” as a temporary property and “added_property” as a normal property

class SubclassedDataFrame2(DataFrame):

    # temporary properties
    _internal_names = pd.DataFrame._internal_names + ['internal_cache']
    _internal_names_set = set(_internal_names)

    # normal properties
    _metadata = ['added_property']

@property
def _constructor(self):
    return SubclassedDataFrame2

_

>>> df = SubclassedDataFrame2({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
>>> df
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

>>> df.internal_cache = 'cached'
>>> df.added_property = 'property'

>>> df.internal_cache
cached
>>> df.added_property
property

# properties defined in _internal_names is reset after manipulation
>>> df[['A', 'B']].internal_cache
AttributeError: 'SubclassedDataFrame2' object has no attribute 'internal_cache'

# properties defined in _metadata are retained
>>> df[['A', 'B']].added_property
property

As you can see the benefit of defining custom properties through _metadata is that the properties will be propagated automatically during (most) one-to-one dataframe operations. Be aware though that during many-to-one dataframe operations (e.g. merge() or concat()) your custom properties will still be lost.

Upvotes: 3

jpp
jpp

Reputation: 164773

Like most Python objects, you can add an attribute using period (.) syntax. However, you should be careful your attribute names do not conflict with labels. Here's a demonstration:

import pandas as pd

s = pd.Series(list(range(3)), index=list('abc'))
s.a = 10
s.d = 20

print(s.a, s.d)

10 20

print(s)

a    10
b     1
c     2

As above you may unwittingly overwrite the value for a label when in fact you want to add an a attribute. One way to alleviate this problem, as described here, is to perform a simple check:

if 'a' not in s:
    s.a = 100
else:
    print('Attempt to overwrite label when setting attribute aborted!')
    # or raise a custom error

Note that operations on a dataframe such as GroupBy, pivot, etc, as described here, may return copies of data with attributes removed.

Finally, for storing dataframes or series with meta data attached, you may wish to consider HDF5. See, for example, this answer.

Upvotes: 2

Related Questions