How to use Mann-Whitney U test in learning

Question

I have a table (X, Y) where X is a matrix and Y is a vector of classes. Here an example:

X = 0 0 1 0 1   and Y = 1
    0 1 0 0 0           1
    1 1 1 0 1           0

I want to use Mann-Whitney U test to compute the feature importance(feature selection)

from scipy.stats import mannwhitneyu
results = np.zeros((X.shape[1],2))
for i in xrange(X.shape[1]):
    u, prob = mannwhitneyu(X[:,i], Y)
    results[i,:] = u, pro

I'm not sure if this is correct or no? I obtained large values for a large table, u = 990 for some columns.

Akavall · Accepted Answer

I don't think that using Mann-Whitney U test is a good way to do feature selection. Mann-Whitney tests whether distributions of the two variable are the same, it tells you nothing about how correlated the variables are. For example:

>>> from scipy.stats import mannwhitneyu
>>> a = np.arange(100)
>>> b = np.arange(100)
>>> np.random.shuffle(b)
>>> np.corrcoef(a,b)
   array([[ 1.        , -0.07155116],
          [-0.07155116,  1.        ]])
>>> mannwhitneyu(a, b)
(5000.0, 0.49951259627554112) # result for almost not correlated
>>> mannwhitneyu(a, a)
(5000.0, 0.49951259627554112) # result for perfectly correlated

Because a and b have the same distributions we fail to reject the null hypothesis that the distributions are identical.

And since in features selection you are trying find features that mostly explain Y, Mann-Whitney U does not help you with that.

How to use Mann-Whitney U test in learning

Answers (1)

Related Questions