Cleb
Cleb

Reputation: 25997

How to manipulate column entries using only one specific output of a function that returns several values?

I have a dataframe like this:

import numpy as np
import pandas as pd

df = pd.DataFrame({'a': range(4), 'b': range(2, 6)})

   a  b
0  0  2
1  1  3
2  2  4
3  3  5

and I have a function that returns several values. Here I just use a dummy function that returns the minimum and maximum for a certain input iterable:

def return_min_max(x):
    return (np.min(x), np.max(x))

Now I want to e.g. add the maximum of each column to each value in the respective column.

So

df.apply(return_min_max)

gives

a    (0, 3)
b    (2, 5)

and then

df.add(df.apply(return_min_max).apply(lambda x: x[1]))

yields the desired outcome

   a   b
0  3   7
1  4   8
2  5   9
3  6  10

I am wondering whether there is a more straightforward way that avoids the two chained apply's.

Just to make sure:

I am NOT interested in a

df.add(df.max())

type solution. I highlighted the dummy_function to illustrate that this not my actual function but just serves as a minimal example function that has several outputs.

Upvotes: 2

Views: 47

Answers (2)

Quang Hoang
Quang Hoang

Reputation: 150735

At a second look, your return_min_max is a column function. So it is not that bad. You can do, e.g:

# create a dataframe for easy access
ret_df = pd.DataFrame(df.apply(return_min_max).to_dict())
#    a  b
# 0  0  2
# 1  3  5

# add 
df.add(ret_df.loc[1], axis=1)

Output:

   a   b
0  3   7
1  4   8
2  5   9
3  6  10

And numpy broadcast:

df.values[None,:] + ret_df.values[:,None]

gives:

array([[[ 0,  4],
        [ 1,  5],
        [ 2,  6],
        [ 3,  7]],

       [[ 3,  7],
        [ 4,  8],
        [ 5,  9],
        [ 6, 10]]], dtype=int64)

Upvotes: 3

ALollz
ALollz

Reputation: 59519

DataFrame.max will returns a Series of the column-wise maximum values. DataFrame.add() will then add this Series, aligning on columns.

df.add(df.max())

#   a   b
#0  3   7
#1  4   8
#2  5   9
#3  6  10

If you're real function is much more complicated, there are a few alternatives.

Keep it as is, use .str to access the max element.

def return_min_max(x):
    return (np.min(x), np.max(x))

df.add(df.apply(return_min_max).str[1])

Consider returning a Series with the index being descriptive about what is returned:

def return_min_max(x):
    return pd.Series([np.min(x), np.max(x)], index=['min', 'max'])

df.add(df.apply(return_min_max).loc['max'])

Or if the returns can be separated (in this case max and min really don't need to be done in the same function), it's simpler to have them separated:

def return_max(x):
    return np.max(x)

df.add(df.apply(return_max))

Upvotes: 2

Related Questions