JHixson
JHixson

Reputation: 1522

What is the best way to extend the functionality of an existing class of an existing 3rd party library?

Lets say for example that I am using lots of pandas.DataFrames in my code. This is a really big class with lots of pieces, and it has a very functional-like API to it, where you tend to chain its method calls together.

It feels like it would be nice to easilly be able to add functionality to this existing class while still keeping the functional, flowing API in tact.

I wanted to turn this class into something a bit domain-specific in the interests of removing a lot of domain-specific boilerplate code that is often used when using this class.

Let's say in my dreams I'm to be able to do something like this:

pandas.read_csv('sales.csv') \
    .filter(items=['one', 'three']) \
    .apply(myTransformationFunction) \
    .saveToHivePartition(tablename = 'sales', partitionColumn = 'four') \
    .join(pandas.read_csv('employees.csv')) \
    .filter(items=['one','three','five']) \
    .saveToHivePartition(tablename = 'EmployeeMetrics', partitionColumn = 'ten')

In this example, saveToHivePartition is a custom method that does a lot of stuff to atomically save something to HDFS at the correct location and then adds that information tot he hive metadatastore. Super useful!

Obviously the "easy" answer is to simply create a standalone function that I can pass a DataFrame object into along with the other paramaters that I need. Every time I want to perform the save, I need to encapsulate the dataframe into a variable and then pass that variable into a separate function. Nowhere near as clean as the above example!

So then the next thing that comes to mind is to create some cool new SuperDataFrame class that is a superset of DataFrame. The only issue with that approach is: How do you even construct it? i can't change pandas.read_csv() to start returning my new class. I can't cast a base class into a subclass. I could have SuperDataFrame be... a wrapper for the dataframe class? But then I think that any time I want to call a base DataFrame function, It would have to be like this:

SuperDataFrame(pandas.read_csv('sales.csv')) \
    .df.filter .... \
    .df.apply .... \
    .saveToHivePartition .... \
    .df.join .... \
    .df.filter .... \
    .saveToHivepartition ....

I don't even think this works, because this assumes that whenever pandas performs a function, it mutates the DataFrame in-place, which I'm pretty sure it doesn't even do.

Any ideas? Or is this just a bad thing?

Upvotes: 2

Views: 1020

Answers (2)

BrenBarn
BrenBarn

Reputation: 251408

You can use the constructor-override properties to ensure that pandas operations return an instance of your new class. A simple example:

class MyDF(pandas.DataFrame):
    @property
    def _constructor(self):
        return MyDF

    def myMethod(self):
        return "I am cool"

>>> d = MyDF([[1, 2, 3], [4, 5, 6], [7, 8, 9]], columns=["A", "B", "C"])
>>> d.filter(["A", "B"]).apply(lambda c: c**2).myMethod()
'I am cool'

Of course, you'll also have to make your own Series subclass and potentially define custom data slots if you also want to be able to handle Series.

That said, it's not clear to me that your example really warrants this. You can already chain method calls on ordinary DataFrames, so if all you want to do is save some intermediate stages, it's not that onerous to do:

data = pandas.read_csv('sales.csv') \
    .filter(items=['one', 'three', 'five']) \
    .apply(myTransformationFunction)

data2 = data.join(pandas.read_csv('employees.csv')) \
    .filter(items=['one', 'three'])

saveToHivePartition(data, tablename='sales', partitionColumn='four')
saveToHivePartition(data2, tablename='EmployeeMetrics', partitionColumn='ten')

(I changed the order of the filters due to the issue I mentioned in my comment on the question. It doesn't make sense to filter to a smaller subset first and a larger subset later; the larger subset won't be there later because some of it will already have been filtered out.)

The mere ability to write things as method chains isn't inherently an advantage. I would be especially dubious of chaining methods like your .saveToHivePartition, which presumably are called only for their side effects. It makes sense to chain methods when they're acting in assembly-line fashion, each taking the input of the previous one and modifying it to pass to the next. But having methods that just do a side effect and return the unchanged object doesn't do much to make the code more readable.

Also, note that this solution is specific to pandas. In general, if classes in some library create instances of one another, the library must be carefully designed to allow subclassing in a way that preserves the relationships among the classes. Pandas has done this with the constructor-override mechanism I described, but it wasn't always like that, and before that, it was difficult to do what you're trying to do.

Upvotes: 1

blue note
blue note

Reputation: 29081

One way would be to extend the DataFrame class. To keep the fluent interface, you could just create your own custom.read_csv

def read_csv(file):
    dataframe = pandas.read_csv(file)
    return SuperDataFrame(dataframe)

You start by calling that function, and the rest of the function calls remain as they are. You just add to your new dataframe the functions you want.

Alternatively (python-specific hack), you can probably monkey-patch your own function directly into the original dataframe class in the module where you import it, but don't...

Upvotes: 0

Related Questions