Subclass a DataFrame without mutating original object

Question

As mentioned a while back by @piRSquared, subclassing a pandas DataFrame in the way suggested in the docs, or by geopandas' GeoDataFrame, opens up the possibility of unwanted mutation of the original object:

class SubFrame(pd.DataFrame):

    def __init__(self, *args, **kwargs):
        attr = kwargs.pop('attr', None)
        super(SubFrame, self).__init__(*args, **kwargs)
        self.attr = attr

    @property
    def _constructor(self):
        return SubFrame

    def somefunc(self):
        """Add some extended functionality."""
        pass

df = pd.DataFrame([[1, 2], [3, 4]])
sf = SubFrame(df, attr=1)

sf[:] = np.nan  # Modifies `df`
print(df)

#     0   1
# 0 NaN NaN
# 1 NaN NaN

The error-prone "fix" is to pass a copy at instantiation:

sf = SubFrame(df.copy(), attr=1)

But this is easily subject to user error. My question is: can I create a copy of self (the passed DataFrame) within class SubFrame itself? How would I go about doing this?

If the answer is "No," I appreciate that too, so that I can scrap this endeavor before wasting time on it.

A polite request

The pandas docs suggest two alternatives:

Extensible method chains with pipe
Composition

I've already thoroughly considered both of these, so I'd appreciate if answers can refrain from an opinionated discussion with generic reasons about why these 2 alternatives are better/safer.

Alex Zisman · Accepted Answer

Self is not the dataframe you are passing. Regardless, you can perform the copy in the init function.

For example

import copy

def __init__(self, farg, **kwargs):
    farg = copy.deepcopy(farg)
    attr = kwargs.pop('attr', None)
    super().__init__(farg)
    self.attr = attr

Should show you that the farg is the df you are passing.

I don't know much about subclassing a DataFrame, so if you want to keep you original init structure you can copy all the *args. Can't speak to the safety of this approach.

def __init__(self, *args, **kwargs):
    cargs = tuple(copy.deepcopy(arg) for arg in args)
    attr = kwargs.pop('attr', None)
    super().__init__(*cargs, **kwargs)
    self.attr = attr

Subclass a DataFrame without mutating original object

A polite request

Answers (1)

Related Questions