Reputation: 117

Alternatives to nested numpy.where for multiconditional pandas operations?

I have a Pandas DataFrame with conditional column A and numeric column B.

    A    B
1 'foo' 1.2
2 'bar' 1.3
3 'foo' 2.2

I also have a Python dictionary that defines ranges of B which denote "success" given each value of A.

mydict = {'foo': [1, 2], 'bar': [2, 3]}

I want to make a new column, 'error', in the dataframe. It should describe how far outside of the acceptable bounds for A the value of B falls. If A is within the range, the value should be zero.

    A    B   error
1 'foo' 1.2   0
2 'bar' 1.3  -0.7
3 'foo' 2.2   0.2

I'm not a complete Pandas/Numpy newbie, and I'm halfway decent at Python, but this proved somewhat difficult. I don't want to do it with iterrows(), since I understand that's computationally expensive and this is going to get called a lot.

I eventually figured out a solution by combining lambda functions, pandas.DataFrame.map(), and nested numpy.where()s with given values for the optional x and y inputs.

getmin = lambda x: mydict[x][0]
getmax = lambda x: mydict[x][1] 
df['error'] = np.where(df.B < dtfr.A.map(getmin),
                       df.B - df.A.map(getmin),
                       np.where(df.B > df.A.map(getmax),
                                df.B - df.A.map(getmax),
                                0
                                )
                       )

It works, but this can't possibly be the best way to do this, right? I feel like I'm abusing numpy.where() to get around not knowing how to map values from multiple columns of a dataframe to a lambda function in a non-iterative way. (Also to avoid writing mildly gnarly lambda functions).

Kind of three questions, I guess.

Is it OK to nest numpy.where()s for triconditional array operations?
How can I non-iteratively map from two dataframe columns to one function?
If 2) is possible and 1) is acceptable, which is preferable?

Upvotes: 6

Answers (2)

maxymoo

Reputation: 36545

For your question about how to map multiple columns, you do it with

DataFrame.apply( , axis =1)

For your question I don't think you need this, but I think it's clearer if you do your calculation in a few steps:

df['low'] = df.A.map(lambda x: mydict[x][0])
df['high'] = df.A.map(lambda x: mydict[x][1])
df['error'] = np.maximum(df.B - df.high, 0) + np.minimum(df.B - df.low, 0)
df
     A    B  low  high  error
1  foo  1.2    1     2    0.0
2  bar  1.3    2     3   -0.7
3  foo  2.2    1     2    0.2

Upvotes: 6

Alexander

Reputation: 109536

I believe the code below is arguably more readable.

df['min'] = df.A.apply(lambda x: min(mydict[x]))
df['max'] = df.A.apply(lambda x: max(mydict[x]))
df['error'] = 0.
df.loc[df.B.gt(df['max']), 'error'] = df.B - df['max']
df.loc[df.B.lt(df['min']), 'error'] = df.B - df['min']
df.drop(['min', 'max'], axis=1, inplace=True)
>>> df
     A    B  error
1  foo  1.2    0.0
2  bar  1.3   -0.7
3  foo  2.2    0.2

I don't see why you couldn't use numpy.where() for triconditional array operations, but you do sacrifice simplicity.

Upvotes: 1

Alternatives to nested numpy.where for multiconditional pandas operations?

Answers (2)

Related Questions