Reservedegotist
Reservedegotist

Reputation: 175

pandas DataFrame combine_first and update methods have strange behavior

I'm running into a strange issue (or intended?) where combine_first or update are causing values stored as bool to be upcasted into float64s if the argument supplied is not supplying the boolean columns.

Example workflow in ipython:

In [144]: test = pd.DataFrame([[1,2,False,True],[4,5,True,False]], columns=['a','b','isBool', 'isBool2'])

In [145]: test
Out[145]:
   a  b isBool isBool2
0  1  2  False    True
1  4  5   True   False


In [147]: b = pd.DataFrame([[45,45]], index=[0], columns=['a','b'])

In [148]: b
Out[148]:
    a   b
0  45  45

In [149]: test.update(b)

In [150]: test
Out[150]:
    a   b  isBool  isBool2
0  45  45       0        1
1   4   5       1        0

Was this meant to be the behavior of the update function? I would think that if nothing was specified that update wouldn't mess with the other columns.


EDIT: I started tinkering around a little more. The plot thickens. If I insert one more command: test.update([]) before running test.update(b), boolean behavior works at the cost of numbers upcasted as objects. This also applies to DSM's simplified example.

Based on panda's source code, it looks like the reindex_like method is creating a DataFrame of dtype object, while reindex_like b creates a DataFrame of dtype float64. Since object is more general, subsequent operations work with bools. Unfortunately running np.log on the numerical columns will fail with an AttributeError.

Upvotes: 2

Views: 3993

Answers (2)

Jeff
Jeff

Reputation: 129018

this is a bug, update shouldn't touch unspecified columns, fixed here https://github.com/pydata/pandas/pull/3021

Upvotes: 3

waitingkuo
waitingkuo

Reputation: 93924

Before updating, the dateframe b is been filled by reindex_link, so that b becomes

In [5]: b.reindex_like(a)
Out[5]: 
    a   b  isBool  isBool2
0  45  45     NaN      NaN
1 NaN NaN     NaN      NaN

And then use numpy.where to update the data frame.

The tragedy is that for numpy.where, if two data have different type, the more general one would be used. For example

In [20]: np.where(True, [True], [0])
Out[20]: array([1])

In [21]: np.where(True, [True], [1.0])
Out[21]: array([ 1.])

Since NaN in numpy is floating type, it'll also return an floating type.

In [22]: np.where(True, [True], [np.nan])
Out[22]: array([ 1.])

Therefore, after updating, your 'isBool' and 'isBool2' column become floating type.

I've added this issue on the issue tracker for pandas

Upvotes: 3

Related Questions