shortorian
shortorian

Reputation: 1182

How do I subclass or otherwise extend a pandas DataFrame without breaking DataFrame.append()?

I have a complex object I'd like to build around a pandas DataFrame. I've tried to do this with a subclass, but appending to the DataFrame reinitializes all properties in a new instance even when using _metadata, as recommended here. I know subclassing pandas objects is not recommended but I don't know how to do what I want with composition (or any other method), so if someone can tell me how to do this without subclassing that would be great.

I'm working with the following code:

import pandas as pd

class thisDF(pd.DataFrame):

    @property
    def _constructor(self):
        return thisDF

    _metadata = ['new_property']

    def __init__(self, data=None, index=None, columns=None, copy=False, new_property='reset'):
        
        super(thisDF, self).__init__(data=data, index=index, columns=columns, dtype='str', copy=copy)

        self.new_property = new_property

cols = ['A', 'B', 'C']
new_property = cols[:2]
tdf = thisDF(columns=cols, new_property=new_property)

As in the examples I linked to above, operations like tdf[['A', 'B']].new_property work fine. However, modifying the data in a way that creates a new copy initializes a new instance that doesn't retain new_property. So the code

print(tdf.new_property)
tdf = tdf.append(pd.Series(['a', 'b', 'c'], index=tdf.columns), ignore_index=True)
print(tdf.new_property)

outputs

['A', 'B']
reset

How do I extend pd.DataFrame so that thisDF.append() retains instance attributes (or some equivalent data structure if not using a subclass)? Note that I can do everything I want by making a class with a DataFrame as an attribute, but I don't want to do my_object.dataframe.some_method() for all DataFrame operations.

Upvotes: 3

Views: 1865

Answers (1)

sunnytown
sunnytown

Reputation: 1996

"[...] or wrapping all DataFrame methods with my_object class methods (because I'm assuming that would be a lot of work, correct?)"

No it doesn't have to be a lot of work. You actually don't have to wrap every function of the wrapped object yourself. You can use getattr to pass calls down to your wrapped object like this:

class WrappedDataFrame:
    def __init__(self, df, new_property):
        self._df = df
        self.new_property = new_property
    
    def __getattr__(self, attr):
        if attr in self.__dict__:
            return getattr(self, attr)
        return getattr(self._df, attr)
    
    def __getitem__(self, item):
        return self._df[item]
    
    def __setitem__(self, item, data):
        self._df[item] = data

__getattr__ is a dunder method that is called every time you call a method of an instance of that class. In my implementation, every time __getattr__ is implicitly called, it checks if the object has the method you are calling. If it does, that method is returned and executed. Otherwise, it will look for that method in the __dict__of the wrapped object and return that method.

So this class works almost exactly like a DataFrame for the most part. You could now just implement the methods you want to behave differently like append in your example.

You could either make it so that append modifies the wrapped DataFrame object

    def append(self, *args, **kwargs):
        self._df = self._df.append(*args, **kwargs)

or so that it returns a new instance of the WrappedDataFrame class, which of course keeps all your functionality.

    def append(self, *args, **kwargs):
        return self.__class__(self._df.append(*args, **kwargs))

Upvotes: 5

Related Questions