Reputation: 11
I discovered methods chaining in pandas only very recently. I love how it makes the code cleaner and more readable, but I still can't figure out how to use it when I want to modify only a single column, or a group of columns, as part of the pipeline.
For example, let's say this is my DataFrame
:
df = pd.DataFrame({
'num_1': [np.nan, 2., 2., 3., 1.],
'num_2': [9., 6., np.nan, 5., 7.],
'str_1': ['a', 'b', 'c', 'a', 'd'],
'str_2': ['C', 'B', 'B', 'D', 'A'],
})
And I have some manipulation I want to do on it:
numeric_cols = ['num_1', 'num_2']
str_cols = ['str_1', 'str_2']
df[numeric_cols] = df[numeric_cols].fillna(0.).astype('int')
df[numeric_cols] = df[numeric_cols] * 2
df['str_2'] = df['str_2'].str.lower()
df[str_cols] = df[str_cols].replace({'a': 'z', 'b':'y', 'c': 'x'})
My question is - what is the most pandas-y way / best practice to achieve all of the above with method chaining?
I went through the documentation of .assign
and .pipe
, and many answers here, and have gotten as far as this:
def foo_numbers(df):
numeric_cols = ['num_1', 'num_2']
df[numeric_cols] = df[numeric_cols].fillna(0.).astype('int')
df[numeric_cols] = df[numeric_cols] * 2
return df
df = (df
.pipe (foo_numbers)
.assign (str_2=df['str_2'].str.lower())
.replace ({'str_1':to_rep, 'str_2':to_rep})
)
which produces the same output. My problems with this are:
pipe
seems to just hide the handling of the numeric columns from the main chain, but the implementation inside hasn't improved at all..replace
requires me to manually name all the columns one by one. What if I have more than just two columns? (You can assume I want to apply the same replacement to all columns)..assign
is OK, but I was hoping there is a way to pass str.lower
as a callable to be applied to that one column, but I couldn't make it work.So what's the correct way to approach these kind of changes to a DataFrame
, using method chaining?
Upvotes: 1
Views: 666
Reputation: 260640
You already have a good approach, except for the fact that you mutate the input. Either make a copy, or chain operations:
def foo_numbers(df):
df = df.copy()
numeric_cols = ['num_1', 'num_2']
df[numeric_cols] = df[numeric_cols].fillna(0, downcast='infer').mul(2)
return df
Or:
def foo_numbers(df):
numeric_cols = ['num_1', 'num_2']
return (df[numeric_cols].fillna(0, downcast='infer').mul(2)
.combine_first(df)
)[df.columns]
Here are some more examples.
Using assign
:
numeric_cols = ['num_1', 'num_2']
str_cols = ['str_1', 'str_2']
(df.assign(**{c: lambda d, c=c: d[c].fillna(0, downcast='infer').mul(2)
for c in numeric_cols})
.assign(**{c: lambda d, c=c: d[c].str.lower().replace({'a': 'z', 'b':'y', 'c': 'x'})
for c in str_cols})
)
Using apply
:
def foo(s):
if pd.api.types.is_numeric_dtype(s):
return s.fillna(0, downcast='infer').mul(2)
elif s.dtype == object:
return s.str.lower().replace({'a': 'z', 'b':'y', 'c': 'x'})
return s
df.apply(foo)
Using pipe
:
def foo(df):
df = df.copy()
df.update(df.select_dtypes('number')
.fillna(0, downcast='infer').mul(2))
df.update(df.select_dtypes(object)
.apply(lambda s: s.str.lower().replace({'a': 'z', 'b':'y', 'c': 'x'}))
)
return df
df.pipe(foo)
Upvotes: 2
Reputation: 28644
One option, with the method chaining:
(df
.loc(axis=1)[numeric_cols]
.fillna(0,downcast='infer')
.mul(2)
.assign(**df.loc(axis=1)[str_cols]
.transform(lambda f: f.str.lower())
.replace({'a':'z', 'b':'y','c':'x'}))
)
num_1 num_2 str_1 str_2
0 0 18 z x
1 4 12 y y
2 4 0 x y
3 6 10 z d
4 2 14 d z
Another option, using pyjanitor's transform_columns:
(df.transform_columns(numeric_cols,
lambda f: f.fillna(0,downcast='infer').mul(2),
elementwise=False)
.transform_columns(str_cols, str.lower)
.replace({'a':'z', 'b':'y','c':'x'})
)
num_1 num_2 str_1 str_2
0 0 18 z x
1 4 12 y y
2 4 0 x y
3 6 10 z d
4 2 14 d z
Upvotes: 1
Reputation: 37787
I would do it this way with the help of pandas.select_dtypes
and pandas.concat
:
import numpy as np
df = (
pd.concat(
[df.select_dtypes(np.number)
.fillna(0)
.astype(int)
.mul(2),
df.select_dtypes('object')
.apply(lambda s: s.str.lower())
.replace({'a':'z', 'b':'y', 'c':'x'})], axis=1)
)
Output :
print(df)
num_1 num_2 str_1 str_2
0 0 18 z x
1 4 12 y y
2 4 0 x y
3 6 10 z d
4 2 14 d z
Upvotes: 3