Freek
Freek

Reputation: 1189

Adding an attribute (function) to pandas.DataFrame() as part of a class

I have the feeling this must have been asked before but I might lack the vocabulary to search and describe my problem.

I have made a Python3 class that takes a directory as input and scrapes a lot of data together into a pandas.DataFrame, such that I can do this:

mymodule.myclass('/some/dir').get_tpm_values()

And get a pd.DataFrame with a some columns and rows, like this:

>>> seqit.Seqrun(41).get_tpm_values()
                 0041_P2017BB2S5R_S1  0041_P2017BB2S3R_S2  0041_P2017BB2S4R_S3  0041_P2017BB2S8R_S4  0041_P2017BB5S10R_S5
gene_id                                                                                                                  
ENSG00000000003                53.72                19.31                11.03                33.35                 14.55
ENSG00000000005                 1.05                 0.34                 0.19                 0.84                  0.12
ENSG00000000419                13.35                12.66                11.93                17.61                 22.82

Now this DataFrame is a special DataFrame, it always contains genes on in the index and samples as the columns. As such I can make attributes that act on the returned DataFrame that won't act on any DataFrame. i.e., I'd like to be able to add Hugo symbols to the index like this and save to Excel:

mymodule.myclass('/some/dir').get_tpm_values().add_hugo_symbols_to_index().to_excel('some_excel.xlsx')

This would mean I need to add attributes to Pandas but only inside my class, how would I do that?

Edit, it may be helpfull to post part of my class

class Myclass():
    """
    A class that gives one a handle on a Snakemake sequencing data analysis
    folder
    """
    def __init__(self, seqrun_dir):
        if isinstance(seqrun_dir, int):
            self.seqrun_dir = self.number2seqrun(seqrun_dir)
        else:
            self.seqrun_dir = seqrun_dir   
        self.name = os.path.split(self.seqrun_dir)[-1]
        self.quantification_data_loaded = False
        self.pctpm_values_loaded = False
        self.load_sample_table()

    def get_tpm_values(self):
        """
        Get a pd.DataFrame with the TPM values from loaded quantification_data dictionary
        """
        if not self.quantification_data_loaded:
            self.get_quantification_data()
        self.tpm_values = dict()
        for sample in self.samples:
            try:
                self.tpm_values[sample] = self.quantification_data[sample]['TPM']
            except KeyError:
                print('Filling column', sample, 'with NaNs')
                self.tpm_values[sample] = np.nan
        self.tpm_values = pd.DataFrame(self.tpm_values)
        self.tpm_values_loaded = True
        return self.tpm_values

Upvotes: 0

Views: 1456

Answers (1)

Cleared
Cleared

Reputation: 2590

If I understand your question correctly, you want to add a method to the DataFrame-class. A reference for this can be found at here

In my opinion, the best way to solve this is to create your own DataFrame-class, that inherits from pandas.DataFrame and implements an additional function. See code below for examaple:

class HugoDataFrame(pd.DataFrame):
    def add_hugo_symbols_to_index():
        pass # Do your stuff here

And then in instead of creating a DataFrame and returning, you should create a HugoDataFrame according to:

self.tpm_values = HugoDataFrame(self.tpm_values)

Your other option would be to simply export this functionality to a seperate function which takes a dataframe and modifies it

mymodule.myclass('/some/dir').get_tpm_values().add_hugo_symbols_to_index().to_excel('some_excel.xlsx')

you call

add_hugo_symbols_to_index(mymodule.myclass('/some/dir').get_tpm_values()).to_excel('some_excel.xlsx')

Upvotes: 1

Related Questions