Reputation: 59
Is there a way to attach information on the data source to a pandas series? At the moment I just add columns to the dataframe that indicate the source for each variable...
Thanks a lot for ideas and suggestions!
Upvotes: 3
Views: 376
Reputation: 8634
From the offical pandas documentation:
To let original data structures have additional properties, you should let
pandas
know what properties are added.pandas
maps unknown properties to data names overriding__getattribute__
. Defining original properties can be done in one of 2 ways:
Define
_internal_names
and_internal_names_set
for temporary properties which WILL NOT be passed to manipulation results.Define
_metadata
for normal properties which will be passed to manipulation results.Below is an example to define two original properties, “internal_cache” as a temporary property and “added_property” as a normal property
class SubclassedDataFrame2(DataFrame): # temporary properties _internal_names = pd.DataFrame._internal_names + ['internal_cache'] _internal_names_set = set(_internal_names) # normal properties _metadata = ['added_property'] @property def _constructor(self): return SubclassedDataFrame2
_
>>> df = SubclassedDataFrame2({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}) >>> df A B C 0 1 4 7 1 2 5 8 2 3 6 9 >>> df.internal_cache = 'cached' >>> df.added_property = 'property' >>> df.internal_cache cached >>> df.added_property property # properties defined in _internal_names is reset after manipulation >>> df[['A', 'B']].internal_cache AttributeError: 'SubclassedDataFrame2' object has no attribute 'internal_cache' # properties defined in _metadata are retained >>> df[['A', 'B']].added_property property
As you can see the benefit of defining custom properties through _metadata
is that the properties will be propagated automatically during (most) one-to-one dataframe operations. Be aware though that during many-to-one dataframe operations (e.g. merge()
or concat()
) your custom properties will still be lost.
Upvotes: 3
Reputation: 164773
Like most Python objects, you can add an attribute using period (.
) syntax. However, you should be careful your attribute names do not conflict with labels. Here's a demonstration:
import pandas as pd
s = pd.Series(list(range(3)), index=list('abc'))
s.a = 10
s.d = 20
print(s.a, s.d)
10 20
print(s)
a 10
b 1
c 2
As above you may unwittingly overwrite the value for a label when in fact you want to add an a
attribute. One way to alleviate this problem, as described here, is to perform a simple check:
if 'a' not in s:
s.a = 100
else:
print('Attempt to overwrite label when setting attribute aborted!')
# or raise a custom error
Note that operations on a dataframe such as GroupBy
, pivot
, etc, as described here, may return copies of data with attributes removed.
Finally, for storing dataframes or series with meta data attached, you may wish to consider HDF5. See, for example, this answer.
Upvotes: 2