andreSmol
andreSmol

Reputation: 1038

strange chi-square result using scikit_learn with feature matrix

I am using scikit learn to calculate the basic chi-square statistics(sklearn.feature_selection.chi2(X, y)):

def chi_square(feat,target):
"""   """
from sklearn.feature_selection import chi2
ch,pval =  chi2(feat,target)
return ch,pval



chisq,p = chi_square(feat_mat,target_sc)
print(chisq)
print("**********************")
print(p)

I have 1500 samples,45 features,4 classes. The input is a feature matrix with 1500x45 and a target array with 1500 components. The feature matrix is not sparse. When I run the program and I print the arrray "chisq" with 45 components, I can see that the component 13 has a negative value and p = 1. How is it possible? Or what does it mean or what is the big mistake that I am doing?

I am attaching the printouts of chisq and p:

[  9.17099260e-01   3.77439701e+00   5.35004211e+01   2.17843312e+03
   4.27047184e+04   2.23204883e+01   6.49985540e-01   2.02132664e-01
   1.57324454e-03   2.16322638e-01   1.85592258e+00   5.70455805e+00
   1.34911126e-02  -1.71834753e+01   1.05112366e+00   3.07383691e-01
   5.55694752e-02   7.52801686e-01   9.74807972e-01   9.30619466e-02
   4.52669897e-02   1.08348058e-01   9.88146259e-03   2.26292358e-01
   5.08579194e-02   4.46232554e-02   1.22740419e-02   6.84545170e-02
   6.71339545e-03   1.33252061e-02   1.69296016e-02   3.81318236e-02
   4.74945604e-02   1.59313146e-01   9.73037448e-03   9.95771327e-03
   6.93777954e-02   3.87738690e-02   1.53693158e-01   9.24603716e-04
   1.22473138e-01   2.73347277e-01   1.69060817e-02   1.10868365e-02
   8.62029628e+00]

**********************

[  8.21299526e-01   2.86878266e-01   1.43400668e-11   0.00000000e+00
   0.00000000e+00   5.59436980e-05   8.84899894e-01   9.77244281e-01
   9.99983411e-01   9.74912223e-01   6.02841813e-01   1.26903019e-01
   9.99584918e-01   1.00000000e+00   7.88884155e-01   9.58633878e-01
   9.96573548e-01   8.60719653e-01   8.07347364e-01   9.92656816e-01
   9.97473024e-01   9.90817144e-01   9.99739526e-01   9.73237195e-01
   9.96995722e-01   9.97526259e-01   9.99639669e-01   9.95333185e-01
   9.99853998e-01   9.99592531e-01   9.99417113e-01   9.98042114e-01
   9.97286030e-01   9.83873717e-01   9.99745466e-01   9.99736512e-01
   9.95239765e-01   9.97992843e-01   9.84693908e-01   9.99992525e-01
   9.89010468e-01   9.64960636e-01   9.99418323e-01   9.99690553e-01
   3.47893682e-02]

Upvotes: 2

Views: 3934

Answers (1)

unutbu
unutbu

Reputation: 879471

If you put some print statements in the code defining chi2,

def chi2(X, y):
    X = atleast2d_or_csr(X)
    Y = LabelBinarizer().fit_transform(y)
    if Y.shape[1] == 1:
        Y = np.append(1 - Y, Y, axis=1)
    observed = safe_sparse_dot(Y.T, X)          # n_classes * n_features
    print(repr(observed))
    feature_count = array2d(X.sum(axis=0))
    class_prob = array2d(Y.mean(axis=0))
    expected = safe_sparse_dot(class_prob.T, feature_count)
    print(repr(expected))
    return stats.chisquare(observed, expected)

you'll see that expected ends up having some negative values.

import numpy as np
import sklearn.feature_selection as FS

x = np.array([-0.23918515, -0.29967287, -0.33007592, 0.07383528, -0.09205183,
              -0.12548226, 0.04770942, -0.54318463, -0.16833203, -0.00332341,
              0.0179646, -0.0526383, 0.04288736, -0.27427317, -0.16136621,
              -0.09228812, -0.2255725, -0.03744027, 0.02953499, -0.17387492])

y = np.array([1, 2, 2, 1, 1, 1, 1, 3, 1, 1, 3, 2, 2, 1, 1, 2, 1, 2, 1, 1],
             dtype = 'int64')

FS.chi2(x.reshape(-1,1),y)

yields

observed:
array([[-1.31238179],
       [-0.76922812],
       [-0.52522003]])

expected:
array([[-1.56409796],
       [-0.78204898],
       [-0.26068299]])

stats.chisquared(observed, expected) is then called. There, observed and expected are assumed to be frequencies of categories. They should all be non-negative numbers since frequencies are non-negative.

I'm not familiar enough with scikits-learn to suggest how your problem should be fixed, but it appears that the kind of data you are sending to chi2 is of the wrong sort, since expected should be non-negative.

(e.g. Could it be that the x values above should all be positive and represent frequencies of observations?)

Upvotes: 1

Related Questions