Steve
Steve

Reputation: 429

How to sum with missing values in Pandas?

I want to sum Pandas Series objects, but I get weird results that seem not to be what the documentation says.

In Pandas 0.19.2, the following code:

a = pd.Series({1: 2, 3: 4})
b = pd.Series({3: 5, 4: 6})
print(a + b)

gives me,

1    NaN
3    9.0
4    NaN
dtype: float64

however, the documentation says:

When summing data, NA (missing) values will be treated as zero

This seems to treat them as NaN rather than zeros. I was expecting the output:

1    2.0
3    9.0
4    6.0
dtype: float64

In my case the Series comes from value_counts() over several columns and I wanted to use sum() but it gives me NaN for all rows that don't have values in all columns, which is wrong. There should be an integer for every row.

Another mystery for me is why the result has dtype float:

a.dtype, b.dtype, (a+b).dtype

gives,

(dtype('int64'), dtype('int64'), dtype('float64'))

which is quite surprising to me.

Edit: if I make sure that a and b have the same rows, then the resulting dtype is int64. So the change to float is clearly just to allow for the NaN value, which is a bit shocking.

Edit 2: Fix mistake in the expected output.

Upvotes: 3

Views: 15103

Answers (1)

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 95948

The claim from the documentation refers to reducing sums, i.e:

>>> a + b
1    NaN
3    9.0
4    NaN
dtype: float64
>>> (a + b).sum()
9.0 # nans treated as zero...

Not vectorized sums. You'll have to do this explicitely:

>>> (a + b).fillna(0)
1    0.0
3    9.0
4    0.0
dtype: float64

As for the promotion to float, that is a common pandas gotcha, which you can read about here

Given your problem description, i.e. summarizing value counts across columns, you may want to add a fill_value to the addition, which the pd.Series.add method lets you do:

>>> a.add(b, fill_value=0)
1    2.0
3    9.0
4    6.0
dtype: float64

Note, unfortunately, it still does type-promotion due to NaNs. If it is an issue you can easily fix it:

>>> a.add(b, fill_value=0).astype(np.int)
1    2
3    9
4    6
dtype: int64

Upvotes: 7

Related Questions