Tom_Hanks
Tom_Hanks

Reputation: 527

How to ignore NaN in the dataframe for Mann-whitney u test?

I have a dataframe as below.

enter image description here

I want p-value of Mann-whitney u test by comparing each column. As an example, I tried below.

from scipy.stats import mannwhitneyu
mannwhitneyu(df['A'], df['B'])

This results in the following values.

MannwhitneyuResult(statistic=3.5, pvalue=1.8224273379076809e-05)

I wondered whether NaN affected the result, thus I made the following df2 and df3 dataframes as described in the figure and tried below.

mannwhitneyu(df2, df3)

This resulted in

MannwhitneyuResult(statistic=3.5, pvalue=0.00025322465545184154)

So I think NaN values affected the result. Does anyone know how to ignore NaN values in the dataframe?

enter image description here

Upvotes: 2

Views: 3299

Answers (2)

jsaporta
jsaporta

Reputation: 1149

As you can see, there is no argument in the mannwhitneyu function allowing you to specify its behavior when it encounters NaN values, but if you inspect its source code, you can see that it doesn't take NaN values into account when calculating some of the key values (n1, n2, ranked, etc.). This makes me suspicious of any results that you'd get when some of the input values are missing. If you don't feel like implementing the function yourself with NaN-ignoring capabilities, probably the best thing to do is to either create new arrays without missing values as you've done, or use df['A'].dropna() as suggested in the other answer.

Upvotes: 0

Yuca
Yuca

Reputation: 6091

you can use df.dropna() you can find extensive documentation here dropna

As per your example, the syntax would go something like this:

mannwhitneyu(df['A'].dropna(),df['B'])

Upvotes: 3

Related Questions