Reputation: 1032
I'm working on a pipeline of manipulations on a pandas dataframe in class, and I'm wondering what the good steps are for concatenating some procedures one after the other - should I copy and recreate the original dataframe, or just change it in place?
According to the pandas documentation, working with views is not always recommended, and I'm not sure if this is the case here.
For example:
# first approach
import pandas as pd
class MyDataframe:
def __init__(self):
data = [['alice', 7], ['bob', 15], ['carol', 2]]
self.df = pd.DataFrame(data, columns = ['Name', 'Age'])
self.df_with_double_age_col = self._double_age()
self.df_with_double_age_col_and_first_letter = self._get_first_letter_of_name()
def _double_age(self):
self.df['double_age'] = self.df['Age'] * 2
return self.df
def _get_first_letter_of_name(self):
self.df['first_letter'] = self.df['Name'].str[0]
return self.df
In the preceding implementation, we obtain that self.df
is equal to self.df_with_double_age_col
and equal to self.df_with_double_age_col
.
You can see that if you'll execute:
my_df = MyDataframe()
print(my_df.df.equals(my_df.df_with_double_age_col)) # True
I don't think it's a good situation (one thing with several names). I'll get a lot of aliasing if I add more processing steps. Alternatively, I was concerned about using only one dataframe and overwriting it at each step.
So, what do you see as any problem in that situation? I've included two additional optional implementations for that case below.
import pandas as pd
class MyDataframe:
def __init__(self):
data = [['alice', 7], ['bob', 15], ['carol', 2]]
self.df = pd.DataFrame(data, columns = ['Name', 'Age'])
self.df = self._double_age()
self.df = self._get_first_letter_of_name()
def _double_age(self):
self.df['double_age'] = self.df['Age'] * 2
return self.df
def _get_first_letter_of_name(self):
self.df['first_letter'] = self.df['Name'].str[0]
return self.df
return
s, third approach)):import pandas as pd
class MyDataframe:
def __init__(self):
data = [['alice', 7], ['bob', 15], ['carol', 2]]
self.df = pd.DataFrame(data, columns = ['Name', 'Age'])
self._double_age()
self._get_first_letter_of_name()
def _double_age(self):
self.df['double_age'] = self.df['Age'] * 2
def _get_first_letter_of_name(self):
self.df['first_letter'] = self.df['Name'].str[0]
I believe that the last option is the most compact and elegant, but changing the given dataframe may be inconvenient and risky (in the context of SettingWithCopyWarning
).
Upvotes: 3
Views: 3626
Reputation: 3007
I personally don't like modifying data in place. It makes me worry about what might happen if the method gets called more than once, or what might happen if the program crashes after some in place modifications, or if other parts of the program are refencing the table object.
I'd probably solve the task as either using .pipe
or .assign
.
Making only small changes:
import pandas as pd
class MyDataframe:
def __init__(self):
data = [['alice', 7], ['bob', 15], ['carol', 2]]
self.df = (pd
.DataFrame(data, columns = ['Name', 'Age'])
.pipe(self._double_age)
.pipe(self._get_first_letter_of_name)
)
def _double_age(self):
return self.df.assign(double_age = lambda x: x['Age'] * 2)
def _get_first_letter_of_name(self):
return self.df.assign(first_letter = lambda x: x['Name'].str[0])
That said, you could also do this:
import pandas as pd
class MyDataframe:
def __init__(self):
data = [['alice', 7], ['bob', 15], ['carol', 2]]
self.df = (pd
.DataFrame(data, columns = ['Name', 'Age'])
.assign(double_age = lambda x: x['Age'] * 2)
.assign(first_letter = lambda x: x['Name'].str[0])
)
Unless you want to be a salvage, in which case:
class MyDataframe:
def __init__(self):
data = [['alice', 7], ['bob', 15], ['carol', 2]]
self.df = (pd
.DataFrame(data, columns = ['Name', 'Age'])
.assign(**{
'double_age': lambda x: x['Age'] * 2,
'first_letter': lambda x: x['Name'].str[0]
})
)
If this were my codebase, I'd probably end up with
df = pd.DataFrame({'name':['Alice', 'Bob', 'Carol'], 'Age':[7,15,2]})
def add_features(df:pd.DataFrame)->pd.DataFrame:
ans = df.assign(**{
'double_age': lambda x: x['Age'] * 2,
'first_letter': lambda x: x['Name'].str[0]
})
return ans
df.pipe(add_features)
Upvotes: 1
Reputation: 514
Actually, both of your implementations (the three of them) are mutating your class's df
in-place and thus are practically equivalent in matters of their effect on the class. The difference is that in the first and second implementations, after mutating in-place you are returning the mutated df
. In the first implementation assigning it to different class attributes (aliases as you call them) and in the second implementation assigning it to itself (which has no effect).
So, if any, the third implementation is the more valid one.
Nonetheless, in case you want to achieve a more functional syntax, which one can say is the pandas
way "for concatenating some procedures one after the other" as you stated, you can use functions instead of methods and pass the self.df
to them as a parameter. Then you can assign the results to new columns in your self.df
.
For example:
class MyDataframe:
def __init__(self):
data = [['alice', 7], ['bob', 15], ['carol', 2]]
self.df = pd.DataFrame(data, columns = ['Name', 'Age'])
self.df['double_age'] = _double_age(self.df)
self.df['first_letter'] = _get_first_letter_of_name(self.df)
def _double_age(df):
return df['Age'] * 2
def _get_first_letter_of_name(df):
return df['Name'].str[0]
Or in case you prefer functions that always return a new DataFrame
(so you have access to intermediate states):
class MyDataframe:
def __init__(self):
data = [['alice', 7], ['bob', 15], ['carol', 2]]
self.df = pd.DataFrame(data, columns = ['Name', 'Age'])
self.df_with_double_age_col = _double_age(self.df)
self.df_with_double_age_col_and_first_letter = _get_first_letter_of_name(self.df_with_double_age_col)
def _double_age(df):
new_df = df.copy()
new_df['double_age'] = new_df['Age'] * 2
return new_df
def _get_first_letter_of_name(df):
new_df = df.copy()
new_df['first_letter'] = new_df['Name'].str[0]
return new_df
Pay attention that in both examples the function are module-level. Of course, you can choose to make them methods but then you will probably want to annotate them as @staticmethod
Upvotes: 3