In place change or overwrite dataframe?

Question

I'm working on a pipeline of manipulations on a pandas dataframe in class, and I'm wondering what the good steps are for concatenating some procedures one after the other - should I copy and recreate the original dataframe, or just change it in place?

According to the pandas documentation, working with views is not always recommended, and I'm not sure if this is the case here.

For example:

# first approach
import pandas as pd

class MyDataframe:
  def __init__(self):
    data = [['alice', 7], ['bob', 15], ['carol', 2]]
    self.df = pd.DataFrame(data, columns = ['Name', 'Age'])
    self.df_with_double_age_col = self._double_age()
    self.df_with_double_age_col_and_first_letter = self._get_first_letter_of_name()

  def _double_age(self):
    self.df['double_age'] = self.df['Age'] * 2
    return self.df

  def _get_first_letter_of_name(self):
    self.df['first_letter'] = self.df['Name'].str[0]
    return self.df

In the preceding implementation, we obtain that self.df is equal to self.df_with_double_age_col and equal to self.df_with_double_age_col.

You can see that if you'll execute:

my_df = MyDataframe()
print(my_df.df.equals(my_df.df_with_double_age_col)) # True

I don't think it's a good situation (one thing with several names). I'll get a lot of aliasing if I add more processing steps. Alternatively, I was concerned about using only one dataframe and overwriting it at each step.

So, what do you see as any problem in that situation? I've included two additional optional implementations for that case below.

Overwrite (second approach):

import pandas as pd

class MyDataframe:
  def __init__(self):
    data = [['alice', 7], ['bob', 15], ['carol', 2]]
    self.df = pd.DataFrame(data, columns = ['Name', 'Age'])
    self.df = self._double_age()
    self.df = self._get_first_letter_of_name()

  def _double_age(self):
    self.df['double_age'] = self.df['Age'] * 2
    return self.df

  def _get_first_letter_of_name(self):
    self.df['first_letter'] = self.df['Name'].str[0]
    return self.df

In-place change (without `return`s, third approach)):

import pandas as pd

class MyDataframe:
  def __init__(self):
    data = [['alice', 7], ['bob', 15], ['carol', 2]]
    self.df = pd.DataFrame(data, columns = ['Name', 'Age'])
    self._double_age()
    self._get_first_letter_of_name()

  def _double_age(self):
    self.df['double_age'] = self.df['Age'] * 2
    
  def _get_first_letter_of_name(self):
    self.df['first_letter'] = self.df['Name'].str[0]

I believe that the last option is the most compact and elegant, but changing the given dataframe may be inconvenient and risky (in the context of SettingWithCopyWarning).

idow09 · Accepted Answer

Actually, both of your implementations (the three of them) are mutating your class's df in-place and thus are practically equivalent in matters of their effect on the class. The difference is that in the first and second implementations, after mutating in-place you are returning the mutated df. In the first implementation assigning it to different class attributes (aliases as you call them) and in the second implementation assigning it to itself (which has no effect).

So, if any, the third implementation is the more valid one.

Nonetheless, in case you want to achieve a more functional syntax, which one can say is the pandas way "for concatenating some procedures one after the other" as you stated, you can use functions instead of methods and pass the self.df to them as a parameter. Then you can assign the results to new columns in your self.df.

For example:

class MyDataframe:
  def __init__(self):
    data = [['alice', 7], ['bob', 15], ['carol', 2]]
    self.df = pd.DataFrame(data, columns = ['Name', 'Age'])
    self.df['double_age'] = _double_age(self.df)
    self.df['first_letter'] = _get_first_letter_of_name(self.df)

def _double_age(df):
  return df['Age'] * 2

def _get_first_letter_of_name(df):
  return df['Name'].str[0]

Or in case you prefer functions that always return a new DataFrame (so you have access to intermediate states):

class MyDataframe:
  def __init__(self):
    data = [['alice', 7], ['bob', 15], ['carol', 2]]
    self.df = pd.DataFrame(data, columns = ['Name', 'Age'])
    self.df_with_double_age_col = _double_age(self.df)
    self.df_with_double_age_col_and_first_letter = _get_first_letter_of_name(self.df_with_double_age_col)

def _double_age(df):
  new_df = df.copy()
  new_df['double_age'] = new_df['Age'] * 2
  return new_df

def _get_first_letter_of_name(df):
  new_df = df.copy()
  new_df['first_letter'] = new_df['Name'].str[0]
  return new_df

Pay attention that in both examples the function are module-level. Of course, you can choose to make them methods but then you will probably want to annotate them as @staticmethod

In place change or overwrite dataframe?

Overwrite (second approach):

In-place change (without `return`s, third approach)):

Answers (2)

Related Questions

In place change or overwrite dataframe?

Overwrite (second approach):

In-place change (without returns, third approach)):

Answers (2)

Related Questions

In-place change (without `return`s, third approach)):