attach functions to pandas

Question

This may be a very basic question (and I can remove it if there are objections to it).

Suppose I have a function that I reuse multiple times in various projects:

def sort_clean(x, sort_cols):
   x.sort(sort_cols, inplace=True)
   x.reset_index(inplace=True, drop=True)

I want to make this a part of pandas module such that whenever I do import pandas and define a dataframe myDf I can get mfDf.sort_clean as an available function for that dataframe. Is this possible?

ely · Accepted Answer

You can subclass a DataFrame

class NewDataFrame(pandas.DataFrame):
    def sort_clean(self, sort_cols):
        self.sort(sort_cols, inplace=True)
        self.reset_index(inplace=True, drop=True)

For example:

In [25]: class NewDataFrame(pandas.DataFrame):
   ....:     def sort_clean(self, sort_cols):
   ....:         self.sort(sort_cols, inplace=True)
   ....:         self.reset_index(inplace=True, drop=True)
   ....:         

In [26]: dfrm
Out[26]: 
          A         B         C
0  0.382531  0.287066  0.345749
1  0.725201  0.450656  0.336720
2  0.146883  0.266518  0.011339
3  0.111154  0.190367  0.275750
4  0.757144  0.283361  0.736129
5  0.039405  0.643290  0.383777
6  0.632230  0.434664  0.094089
7  0.658512  0.368150  0.433340
8  0.062180  0.523572  0.505400
9  0.287539  0.899436  0.194938

[10 rows x 3 columns]

In [27]: my_df = NewDataFrame(dfrm) 

In [28]: my_df.sort_clean(["B", "C"])

In [29]: my_df
Out[29]: 
          A         B         C
0  0.111154  0.190367  0.275750
1  0.146883  0.266518  0.011339
2  0.757144  0.283361  0.736129
3  0.382531  0.287066  0.345749
4  0.658512  0.368150  0.433340
5  0.632230  0.434664  0.094089
6  0.725201  0.450656  0.336720
7  0.062180  0.523572  0.505400
8  0.039405  0.643290  0.383777
9  0.287539  0.899436  0.194938

[10 rows x 3 columns]

But be aware that using any functions which return new DataFrame objects will not return a NewDataFrame automatically.

Normal monkey-patching (e.g. just creating a new attribute onto an existing DataFrame instance like df.sort_clean = sort_clean) will be tricky because the method needs the instance value supplied as the implicit first argument, especially since you do in-place mutation. For that you'd constantly have to use functools.partial, or a lambda with a default, to achieve the same thing:

df.sort_clean = lambda sort_cols, x=df: sort_clean(x, sort_cols)

Note that with the lambda approach you need to specify the argument that will have a default last (arguments with default values must follow arguments without default values in Python). You can get around this if you choose to use functools.partial instead.

import functools
df.sort_clean = functools.partial(sort_clean, df)

attach functions to pandas

Answers (1)

Related Questions