pandas DataFrame combine_first and update methods have strange behavior

Question

I'm running into a strange issue (or intended?) where combine_first or update are causing values stored as bool to be upcasted into float64s if the argument supplied is not supplying the boolean columns.

Example workflow in ipython:

In [144]: test = pd.DataFrame([[1,2,False,True],[4,5,True,False]], columns=['a','b','isBool', 'isBool2'])

In [145]: test
Out[145]:
   a  b isBool isBool2
0  1  2  False    True
1  4  5   True   False


In [147]: b = pd.DataFrame([[45,45]], index=[0], columns=['a','b'])

In [148]: b
Out[148]:
    a   b
0  45  45

In [149]: test.update(b)

In [150]: test
Out[150]:
    a   b  isBool  isBool2
0  45  45       0        1
1   4   5       1        0

Was this meant to be the behavior of the update function? I would think that if nothing was specified that update wouldn't mess with the other columns.

EDIT: I started tinkering around a little more. The plot thickens. If I insert one more command: test.update([]) before running test.update(b), boolean behavior works at the cost of numbers upcasted as objects. This also applies to DSM's simplified example.

Based on panda's source code, it looks like the reindex_like method is creating a DataFrame of dtype object, while reindex_like b creates a DataFrame of dtype float64. Since object is more general, subsequent operations work with bools. Unfortunately running np.log on the numerical columns will fail with an AttributeError.

Jeff · Accepted Answer

this is a bug, update shouldn't touch unspecified columns, fixed here https://github.com/pydata/pandas/pull/3021

pandas DataFrame combine_first and update methods have strange behavior

Answers (2)

Related Questions