Reputation: 1455
I am building a library for working with very specific structured data and I am building my infrastructure on top of Pandas. Currently I am writing a bunch of different data containers for different use cases, such as CTMatrix for Country x Time Data etc. to house methods appropriate for all CountryxTime structured data.
I am currently debating between
Option 1: Object Inheritance
class CTMatrix(pd.DataFrame):
methods etc. here
or Option 2: Object Use
class CTMatrix(object):
_data = pd.DataFrame
then use getter, setter methods to control access to _data etc.
From a software engineering perspective is there an obvious choice here?
My thoughts so far are:
Option 1:
CTmatrix.sort()
) without having to support them via methods on the encapsulated _data
object in Option #2BUT
__init__()
and having to pass the attributes up to the superclass super(MyDF, self).__init__(*args, **kw)
Option 2:
But
CTMatrix.data.sort()
)Are there any additional downsides for taking the approach in Option #1?
Upvotes: 9
Views: 4805
Reputation: 1867
Because of similar issues and Matti John's answer I wrote a _pandas_wrapper
class for a project of mine, because I also wanted to inherit from pandas Dataframe.
The only purpose of this class is to give a pandas DataFrame lookalike that is safe to inherit from.
If your project is LGPL licensed you can reuse it without problems.
Upvotes: 1
Reputation: 20477
I would avoid subclassing DataFrame
, because many of the DataFrame
methods will return a new DataFrame
and not another instance of your CTMatrix
object.
There are a few of open issues on GitHub around this e.g.:
More generally, this is a question of composition vs inheritance. I would be especially wary of benefit #2. It might seem great now, but unless you are keeping a close eye on updates to Pandas (and it is a fast moving target), you can easily end up with unexpected consequences and your code will end up intertwined with Pandas.
Upvotes: 5