Lucidnonsense
Lucidnonsense

Reputation: 1243

Overloading in python - pandas

I'm building a database type object which, when an index is not found, uses an api to retrieve the information, save it to the object/file and return it.

I'd like to do this by overloading the .loc[x, y] method of the pandas DataFrame but I can't work out how to do this!

At the moment I have:

import pandas as pd
pd.set_option('io.hdf.default_format','table')

class DataBase(pd.DataFrame):
    """DataBase Object which can be updated by external api"""
    def __init__(self, path, api=None):
        super(DataBase, self).__init__(pd.read_hdf('store.h5','df'))
        self.api = api

I may want to change the __init__ function to include a where argument so I can read only what I need to.

I can't think of a way to overload the .loc method properly!

Also, hdf5 is just one method. I'd like to retain the ability to use any other storage methods like sql, or even csv if necessary

Upvotes: 2

Views: 2248

Answers (2)

kwekL
kwekL

Reputation: 33

Adding years later to the answer above, if you ever end up overloading the basic pandas classes, you can override some constructor properties to ensure that the new class is sustained through the standard pandas manipulations, from the Internals:

  • _constructor: Used when a manipulation result has the same dimesions as the original.
  • _constructor_sliced: Used when a manipulation result has one lower dimension(s) as the original, such as DataFrame single columns slicing.
  • _constructor_expanddim: Used when a manipulation result has one higher dimension as the original, such as Series.to_frame() and DataFrame.to_panel().

e.g.:

@property
def _constructor(self):
    return MyDataFrame 

Upvotes: 1

Dunes
Dunes

Reputation: 40713

loc is a property that creates returns a name called _loc if its not None else it creates a pandas.core.indexing._LocIndexer on demand. Indexers, by default have access to the DataFrame that created them, so you can modify the DataFrame on a key miss.

You can override the behaviour of DataFrame.loc by subclassing DataFrame and _LocIndexer as thus.

class MyLocIndexer(_LocIndexer):    
    def __getitem__(self, key):
        try:                   
            return super().__getitem__(key)
        except KeyError:
            item = db.fetch_item(key)
            self[key] = item
            return item
            # `return self[key]' is better as it also works when accessing a 
            # whole axis

class MyDataFrame(DataFrame):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._loc = MyLocIndexer(self, "loc")

The above is written in python3, so you will have to fix the super statements if you are using python2.

Upvotes: 4

Related Questions