Reputation: 429
I want to sum Pandas Series objects, but I get weird results that seem not to be what the documentation says.
In Pandas 0.19.2, the following code:
a = pd.Series({1: 2, 3: 4})
b = pd.Series({3: 5, 4: 6})
print(a + b)
gives me,
1 NaN
3 9.0
4 NaN
dtype: float64
however, the documentation says:
When summing data, NA (missing) values will be treated as zero
This seems to treat them as NaN rather than zeros. I was expecting the output:
1 2.0
3 9.0
4 6.0
dtype: float64
In my case the Series comes from value_counts()
over several columns and I wanted to use sum()
but it gives me NaN for all rows that don't have values in all columns, which is wrong. There should be an integer for every row.
Another mystery for me is why the result has dtype float:
a.dtype, b.dtype, (a+b).dtype
gives,
(dtype('int64'), dtype('int64'), dtype('float64'))
which is quite surprising to me.
Edit: if I make sure that a
and b
have the same rows, then the resulting dtype is int64
. So the change to float is clearly just to allow for the NaN value, which is a bit shocking.
Edit 2: Fix mistake in the expected output.
Upvotes: 3
Views: 15103
Reputation: 95948
The claim from the documentation refers to reducing sums, i.e:
>>> a + b
1 NaN
3 9.0
4 NaN
dtype: float64
>>> (a + b).sum()
9.0 # nans treated as zero...
Not vectorized sums. You'll have to do this explicitely:
>>> (a + b).fillna(0)
1 0.0
3 9.0
4 0.0
dtype: float64
As for the promotion to float
, that is a common pandas
gotcha, which you can read about here
Given your problem description, i.e. summarizing value counts across columns, you may want to add a fill_value
to the addition, which the pd.Series.add
method lets you do:
>>> a.add(b, fill_value=0)
1 2.0
3 9.0
4 6.0
dtype: float64
Note, unfortunately, it still does type-promotion due to NaN
s. If it is an issue you can easily fix it:
>>> a.add(b, fill_value=0).astype(np.int)
1 2
3 9
4 6
dtype: int64
Upvotes: 7