user171780
user171780

Reputation: 3078

Apply function that operates on Pandas dataframes on a subset of the rows

I have a function that receives a dataframe and returns a new dataframe, which is the same but with some added columns. Just as an example:

def arbitrary_function_that_adds_columns(df):
    # In this trivial example I am adding only 1 column, but this function may add an arbitrary number of columns.
    df['new column'] = df['A'] + df['B'] / 8 + df['A']**3
    return df

To apply this function to a whole data frame is easy:

import pandas

df = pandas.DataFrame({'A': [1,2,3,4], 'B': [2,3,4,5]})

df = arbitrary_function_that_adds_columns(df)
print(df)

How do I apply the arbitrary_function_that_adds_columns function to a subset of the rows? I am trying this

import pandas

df = pandas.DataFrame({'A': [1,2,3,4], 'B': [2,3,4,5]})

rows = df['A'].isin({1,3})
df.loc[rows] = arbitrary_function_that_adds_columns(df.loc[rows])

print(df)

but I receive the original dataframe. The result I'm expecting to get is

   A  B  new column
0  1  2         NaN
1  2  3      10.375
2  3  4         NaN
3  4  5      68.625

Upvotes: 1

Views: 1015

Answers (2)

Rodalm
Rodalm

Reputation: 5433

Use pandas.combine_first

Note that, according to the expected output, you want rows=[1,3], not rows = df['A'].isin({1,3}). The latter selects all the rows whose 'A' value is either 1 or 3.

import pandas as pd 

def arbitrary_function_that_adds_columns(df):
    # make sure that the function doesn't mutate the original DataFrame
    # Otherwise, you will get a SettingWithCopyWarning 
    df = df.copy()

    df['new column'] = df['A'] + df['B'] / 8 + df['A']**3
    return df

df = pd.DataFrame({'A': [1,2,3,4], 'B': [2,3,4,5]})

rows = [1, 3]
# the function is applied to a copy of a DataFrame slice 
>>> sub_df = arbitrary_function_that_adds_columns(df.loc[rows])
>>> sub_df

   A  B  new column
1  2  3      10.375
3  4  5      68.625

# Add the new information to the original df 
>>> df = df.combine_first(sub_df)
>>> df

   A  B  new column
0  1  2         NaN
1  2  3      10.375
2  3  4         NaN
3  4  5      68.625

Here is another way that doesn't involve copying the subset of the DataFrame.

def arbitrary_function_that_adds_columns(df, rows='all'):
    if rows == 'all':
        rows = df.index     
    sub_df = df.loc[rows]

    df.loc[rows, 'new column'] = sub_df['A'] + sub_df['B'] / 8 + sub_df['A']**3
    
    return df

>>> df = arbitrary_function_that_adds_columns(df, rows)
>>> df 

   A  B  new column
0  1  2         NaN
1  2  3      10.375
2  3  4         NaN
3  4  5      68.625

Upvotes: 2

Joshua Voskamp
Joshua Voskamp

Reputation: 2044

With the example you've given:

df['A+B'] = df.loc[df['A'].isin({1,3})].sum(axis=1)

or

df['A+B'] = np.nan
df.loc[df['A'].isin({1,3}),['A+B']] = sum_AB(df)

More generally:

df.loc[ [row mask], [column mask] ] = [returned df of same shape]

#optionally, use fillna/bfill/ffill as appropriate

For more complicated stuff, take a look at DataFrame.transform and DataFrame.apply; combining those with df.loc and an appropriate boolean mask will accomplish what you need.

Upvotes: 0

Related Questions