Jonah
Jonah

Reputation: 11

Pandas - modifying single/multiple columns with method chaining

I discovered methods chaining in pandas only very recently. I love how it makes the code cleaner and more readable, but I still can't figure out how to use it when I want to modify only a single column, or a group of columns, as part of the pipeline.

For example, let's say this is my DataFrame:

df = pd.DataFrame({
    'num_1': [np.nan, 2., 2., 3., 1.],
    'num_2': [9., 6., np.nan, 5., 7.],
    'str_1': ['a', 'b', 'c', 'a', 'd'],
    'str_2': ['C', 'B', 'B', 'D', 'A'],
})

And I have some manipulation I want to do on it:

numeric_cols = ['num_1', 'num_2']
str_cols = ['str_1', 'str_2']
df[numeric_cols] = df[numeric_cols].fillna(0.).astype('int')
df[numeric_cols] = df[numeric_cols] * 2
df['str_2'] = df['str_2'].str.lower()
df[str_cols] = df[str_cols].replace({'a': 'z', 'b':'y', 'c': 'x'})

My question is - what is the most pandas-y way / best practice to achieve all of the above with method chaining?


I went through the documentation of .assign and .pipe, and many answers here, and have gotten as far as this:

def foo_numbers(df):
    numeric_cols = ['num_1', 'num_2']
    df[numeric_cols] = df[numeric_cols].fillna(0.).astype('int')
    df[numeric_cols] = df[numeric_cols] * 2
    return df

df = (df
      .pipe    (foo_numbers)
      .assign  (str_2=df['str_2'].str.lower())
      .replace ({'str_1':to_rep, 'str_2':to_rep})
     )

which produces the same output. My problems with this are:

So what's the correct way to approach these kind of changes to a DataFrame, using method chaining?

Upvotes: 1

Views: 666

Answers (3)

mozway
mozway

Reputation: 260640

You already have a good approach, except for the fact that you mutate the input. Either make a copy, or chain operations:

def foo_numbers(df):
    df = df.copy()
    numeric_cols = ['num_1', 'num_2']
    df[numeric_cols] = df[numeric_cols].fillna(0, downcast='infer').mul(2)
    return df

Or:

def foo_numbers(df):
    numeric_cols = ['num_1', 'num_2']
    return (df[numeric_cols].fillna(0, downcast='infer').mul(2)
            .combine_first(df)
            )[df.columns]

Here are some more examples.

Using assign:

numeric_cols = ['num_1', 'num_2']
str_cols = ['str_1', 'str_2']

(df.assign(**{c: lambda d, c=c: d[c].fillna(0, downcast='infer').mul(2)
              for c in numeric_cols})
   .assign(**{c: lambda d, c=c: d[c].str.lower().replace({'a': 'z', 'b':'y', 'c': 'x'})
              for c in str_cols})
 )

Using apply:

def foo(s):
    if pd.api.types.is_numeric_dtype(s):
        return s.fillna(0, downcast='infer').mul(2)
    elif s.dtype == object:
        return s.str.lower().replace({'a': 'z', 'b':'y', 'c': 'x'})
    return s

df.apply(foo)

Using pipe:

def foo(df):
    df = df.copy()
    df.update(df.select_dtypes('number')
              .fillna(0, downcast='infer').mul(2))
    df.update(df.select_dtypes(object)
              .apply(lambda s: s.str.lower().replace({'a': 'z', 'b':'y', 'c': 'x'}))
             )
    return df

df.pipe(foo)

Upvotes: 2

sammywemmy
sammywemmy

Reputation: 28644

One option, with the method chaining:

(df
.loc(axis=1)[numeric_cols]
.fillna(0,downcast='infer')
.mul(2)
.assign(**df.loc(axis=1)[str_cols]
            .transform(lambda f: f.str.lower())
            .replace({'a':'z', 'b':'y','c':'x'}))
)
   num_1  num_2 str_1 str_2
0      0     18     z     x
1      4     12     y     y
2      4      0     x     y
3      6     10     z     d
4      2     14     d     z

Another option, using pyjanitor's transform_columns:

(df.transform_columns(numeric_cols, 
                      lambda f: f.fillna(0,downcast='infer').mul(2), 
                      elementwise=False)
.transform_columns(str_cols, str.lower)
.replace({'a':'z', 'b':'y','c':'x'})
) 
   num_1  num_2 str_1 str_2
0      0     18     z     x
1      4     12     y     y
2      4      0     x     y
3      6     10     z     d
4      2     14     d     z

Upvotes: 1

Timeless
Timeless

Reputation: 37787

I would do it this way with the help of pandas.select_dtypes and pandas.concat :

import numpy as np

df = (
        pd.concat(
                   [df.select_dtypes(np.number)
                            .fillna(0)
                            .astype(int)
                            .mul(2),
                    df.select_dtypes('object')
                            .apply(lambda s: s.str.lower())
                            .replace({'a':'z', 'b':'y', 'c':'x'})], axis=1)
      )

​ Output :

print(df)

   num_1  num_2 str_1 str_2
0      0     18     z     x
1      4     12     y     y
2      4      0     x     y
3      6     10     z     d
4      2     14     d     z

Upvotes: 3

Related Questions