Clade
Clade

Reputation: 986

Is it appropriate to use a class for the purpose of organizing functions that share inputs?

To provide a bit of context, I am building a risk model that pulls data from various different sources. Initially I wrote the model as a single function that when executed read in the different data sources as pandas.DataFrame objects and used those objects when necessary. As the model grew in complexity, it quickly became unreadable and I found myself copy an pasting blocks of code often.

To cleanup the code I decided to make a class that when initialized reads, cleans and parses the data. Initialization takes about a minute to run and builds my model in its entirety.

The class also has some additional functionality. There is a generate_email method that sends an email with details about high risk factors and another method append_history that point-in-times the risk model and saves it so I can run time comparisons.

The thing about these two additional methods is that I cannot imagine a scenario where I would call them without first re-calibrating my risk model. So I have considered calling them in init() like my other methods. I haven't only because I am trying to justify having a class in the first place.

I am consulting this community because my project structure feels clunky and awkward. I am inclined to believe that I should not be using a class at all. Is it frowned upon to create classes merely for the purpose of organization? Also, is it bad practice to call instance methods (that take upwards of a minute to run) within init()?

Ultimately, I am looking for reassurance or a better code structure. Any help would be greatly appreciated.

Here is some pseudo code showing my project structure:

class RiskModel:
    def __init__(self, data_path_a, data_path_b):
        self.data_path_a = data_path_a
        self.data_path_b = data_path_b

        self.historical_data = None
        self.raw_data = None
        self.lookup_table = None
        self._read_in_data()

        self.risk_breakdown = None
        self._generate_risk_breakdown()

        self.risk_summary = None
        self.generate_risk_summary()

    def _read_in_data(self):
        # read in a .csv
        self.historical_data = pd.read_csv(self.data_path_a)

        # read an excel file containing many sheets into an ordered dictionary
        self.raw_data = pd.read_excel(self.data_path_b, sheet_name=None)


        # store a specific sheet from the excel file that is used by most of
        # my class's methods
        self.lookup_table = self.raw_data["Lookup"]

    def _generate_risk_breakdown(self):
        '''
        A function that creates a DataFrame from self.historical_data,
        self.raw_data, and self.lookup_table and stores it in 
        self.risk_breakdown
        '''
        self.risk_breakdown = some_dataframe

    def _generate_risk_summary(self):
        '''
        A function that creates a DataFrame from self.lookup_table and 
        self.risk_breakdown and stores it in self.risk_summary
        '''
        self.risk_summary = some_dataframe

    def generate_email(self, recipient):
        '''
        A function that sends an email with details about high risk factors
        '''

if __name__ == "__main__":
    risk_model = RiskModel(data_path_a, data_path_b)
    risk_model.generate_email([email protected])

Upvotes: 1

Views: 187

Answers (1)

Tim Mironov
Tim Mironov

Reputation: 859

In my opinion it is a good way to organize your project, especially since you mentioned the high rate of re-usability of parts of the code.

One thing though, I wouldn't put the _read_in_data, _generate_risk_breakdown and _generate_risk_summary methods inside __init__, but instead let the user call this methods after initializing the RiskModel class instance.

This way the user would be able to read in data from a different path or only to generate the risk breakdown or summary, without reading in the data once again.

Something like this:

my_risk_model = RiskModel()
my_risk_model.read_in_data(path_a, path_b)
my_risk_model.generate_risk_breakdown(parameters)
my_risk_model.generate_risk_summary(other_parameters)

If there is an issue of user calling these methods in an order which would break the logical chain, you could throw an exception if generate_risk_breakdown or generate_risk_summary are called before read_in_data. Of course you could only move the generate... methods out, leaving the data import inside __init__.

To advocate more on exposing the generate... methods out of __init__, consider a case scenario, where you would like to generate multiple risk summaries, changing various parameters. It would make sense, not to create the RiskModel every time and read the same data, but instead change the input to generate_risk_summary method:

my_risk_model = RiskModel()
my_risk_model.read_in_data(path_a, path_b)

for parameter in [50, 60, 80]:
    my_risk_model.generate_risk_summary(parameter)
    my_risk_model.generate_email('[email protected]')

Upvotes: 1

Related Questions