Reputation: 527
I have a dataframe as below.
I want p-value of Mann-whitney u test by comparing each column. As an example, I tried below.
from scipy.stats import mannwhitneyu
mannwhitneyu(df['A'], df['B'])
This results in the following values.
MannwhitneyuResult(statistic=3.5, pvalue=1.8224273379076809e-05)
I wondered whether NaN
affected the result, thus I made the following df2
and df3
dataframes as described in the figure and tried below.
mannwhitneyu(df2, df3)
This resulted in
MannwhitneyuResult(statistic=3.5, pvalue=0.00025322465545184154)
So I think NaN
values affected the result.
Does anyone know how to ignore NaN
values in the dataframe?
Upvotes: 2
Views: 3299
Reputation: 1149
As you can see, there is no argument in the mannwhitneyu
function allowing you to specify its behavior when it encounters NaN
values, but if you inspect its source code, you can see that it doesn't take NaN
values into account when calculating some of the key values (n1
, n2
, ranked
, etc.). This makes me suspicious of any results that you'd get when some of the input values are missing. If you don't feel like implementing the function yourself with NaN
-ignoring capabilities, probably the best thing to do is to either create new arrays without missing values as you've done, or use df['A'].dropna()
as suggested in the other answer.
Upvotes: 0