Reputation: 175
I'm running into a strange issue (or intended?) where combine_first
or update
are causing values stored as bool
to be upcasted into float64
s if the argument supplied is not supplying the boolean columns.
Example workflow in ipython:
In [144]: test = pd.DataFrame([[1,2,False,True],[4,5,True,False]], columns=['a','b','isBool', 'isBool2'])
In [145]: test
Out[145]:
a b isBool isBool2
0 1 2 False True
1 4 5 True False
In [147]: b = pd.DataFrame([[45,45]], index=[0], columns=['a','b'])
In [148]: b
Out[148]:
a b
0 45 45
In [149]: test.update(b)
In [150]: test
Out[150]:
a b isBool isBool2
0 45 45 0 1
1 4 5 1 0
Was this meant to be the behavior of the update
function? I would think that if nothing was specified that update
wouldn't mess with the other columns.
EDIT: I started tinkering around a little more. The plot thickens. If I insert one more command: test.update([])
before running test.update(b)
, boolean behavior works at the cost of numbers upcasted as objects
. This also applies to DSM's simplified example.
Based on panda's source code, it looks like the reindex_like method is creating a DataFrame of dtype object
, while reindex_like b
creates a DataFrame of dtype float64
. Since object
is more general, subsequent operations work with bools. Unfortunately running np.log
on the numerical columns will fail with an AttributeError
.
Upvotes: 2
Views: 3993
Reputation: 129018
this is a bug, update shouldn't touch unspecified columns, fixed here https://github.com/pydata/pandas/pull/3021
Upvotes: 3
Reputation: 93924
Before updating, the dateframe b
is been filled by reindex_link, so that b becomes
In [5]: b.reindex_like(a)
Out[5]:
a b isBool isBool2
0 45 45 NaN NaN
1 NaN NaN NaN NaN
And then use numpy.where to update the data frame.
The tragedy is that for numpy.where
, if two data have different type, the more general one would be used. For example
In [20]: np.where(True, [True], [0])
Out[20]: array([1])
In [21]: np.where(True, [True], [1.0])
Out[21]: array([ 1.])
Since NaN
in numpy
is floating type, it'll also return an floating type.
In [22]: np.where(True, [True], [np.nan])
Out[22]: array([ 1.])
Therefore, after updating, your 'isBool' and 'isBool2' column become floating type.
I've added this issue on the issue tracker for pandas
Upvotes: 3