jwilner
jwilner

Reputation: 6606

pandas dataframe.where misbehaving

I'm trying to implement a function that returns the max at each position of a dataframe or series, minimizing NaN.

In [217]: a
Out[217]: 
   0  1
0  4  1
1  6  0

[2 rows x 2 columns]

In [218]: b
Out[218]: 
    0   1
0 NaN   3
1   3 NaN

[2 rows x 2 columns]


In [219]: do_not_replace = b.isnull() | (a > b)

In [220]: do_not_replace
Out[220]: 
      0      1
0  True  False
1  True   True

[2 rows x 2 columns]


In [221]: a.where(do_not_replace, b)
Out[221]: 
   0  1
0  4  3
1  1  0

[2 rows x 2 columns]


In [222]: expected
Out[222]: 
   0  1
0  4  3
1  6  0

[2 rows x 2 columns]

In [223]: pd.__version__
Out[223]: '0.13.1'

I imagine there are other ways to implement this function, but I'm unable to figure out this behavior. I mean, where is that 1 coming from? I think the logic is sound. Am I misinterpreting how the function works?

Upvotes: 2

Views: 378

Answers (2)

Jeff
Jeff

Reputation: 128918

This is essentially what where does internally. I think this might be a transpositional bug. Bug fixed here. Turns out a symmetric DataFrame AND a passed frame where required to reproduce. Very subtle. Note that this other form of indexing (below) uses a different method that's inplace so it was ok.

In [56]: a[~do_not_replace] = b

In [57]: a
Out[57]: 
   0  1
0  4  3
1  6  0

Note: this has been fixed in master/0.14.1.

Upvotes: 5

Dan Lenski
Dan Lenski

Reputation: 79732

I can't reproduce this problem with "plain" numpy arrays:

import numpy as np
a=array([(4,1),(6,0)])
b=array([(np.NaN,3),(3,np.NaN)])

print a
print b

do_not_replace = np.isnan(b) | (a>b)
print do_not_replace

print np.where(do_not_replace, a, b)

... gives what you want, I think:

array([[ 4.,  3.],
       [ 6.,  0.]])

@jwilner: As @Jeff suggests, it could be a pandas bug. What version are you running?

Upvotes: 1

Related Questions