Reputation: 13800
I have trouble understanding how pandas and/or numpy are handling NaN values. I am extracting subsets of a pandas dataframe in order to compute t-stats, e.g. I want to know whether there is a significant difference in the mean of x2 for the group whose x1 value is A compared to those with an x1 value of B (apologies for not making this a working example, but I don't know how to recreate the NaN values that pop up in my dataframe, the original data is read in using read_csv, with the csv denoting missing values with NA
):
import numpy as np
import pandas as pd
import scipy.stats as st
A = data[data['x1']=='A']['x2']
B = data[data['x1']=='B'].x2
A
2 3
3 1
5 2
6 3
10 3
12 2
15 2
16 0
21 0
24 1
25 1
28 NaN
31 0
32 3
...
677 0
681 NaN
682 3
683 1
686 2
Name: praxiserf, Length: 335, dtype: float64
That is, I have two pandas.core.series.Series
objects, which I then want to perform a t-test on. However, using
st.ttest_ind(A, B)
returns:
(array(nan), nan)
I presume this has to do with the fact that ttest_ind
accepts arrays as inputs and there seems to be a problem with my NaN values when converting the series to an array. If I try to calculate means of the original series, I get:
A.mean(), B.mean()
1.5802, 1.2
However, when I try to turn the series into an array, I get:
A_array = np.asarray(A)
A_array
array([ 3., 1., 2., 3., 3., 2., 2., 0., 0., 1., 1.,
nan, 0., 3., ..., 1., nan, 0., 3. ])
That is, NaN
turned into nan
and taking means doesn't work anymore:
A.mean()
nan
How should the missing values be treated in order to ensure that I can still do calculations with the series/array?
Upvotes: 4
Views: 2301
Reputation: 7145
ttest_ind takes a parameter called "nan_policy" that dictates how nans are treated. By default nan_policy is "propagate" which results in nan if any values in the input are nan. "raise" will raise an error if any inputs are nan. "omit" ignores nan.
st.ttest_ind(A, B, nan_policy="omit")
should give you a non-nan result.
Upvotes: 1
Reputation: 3881
pandas
uses the same code as the bottleneck
nanmean
function, I believe, thus automatically ignoring nan
s. numpy
doesn't do that for you. What you really want to do, however, is mask the nan
-values in both series and pass that to the t-test:
mask = numpy.logical_and(numpy.isfinite(A), numpy.isfinite(B))
st.ttest_ind(A[mask], B[mask])
Upvotes: 5