Reputation: 1038
I am using scikit learn to calculate the basic chi-square statistics(sklearn.feature_selection.chi2(X, y)):
def chi_square(feat,target):
""" """
from sklearn.feature_selection import chi2
ch,pval = chi2(feat,target)
return ch,pval
chisq,p = chi_square(feat_mat,target_sc)
print(chisq)
print("**********************")
print(p)
I have 1500 samples,45 features,4 classes. The input is a feature matrix with 1500x45 and a target array with 1500 components. The feature matrix is not sparse. When I run the program and I print the arrray "chisq" with 45 components, I can see that the component 13 has a negative value and p = 1. How is it possible? Or what does it mean or what is the big mistake that I am doing?
I am attaching the printouts of chisq and p:
[ 9.17099260e-01 3.77439701e+00 5.35004211e+01 2.17843312e+03
4.27047184e+04 2.23204883e+01 6.49985540e-01 2.02132664e-01
1.57324454e-03 2.16322638e-01 1.85592258e+00 5.70455805e+00
1.34911126e-02 -1.71834753e+01 1.05112366e+00 3.07383691e-01
5.55694752e-02 7.52801686e-01 9.74807972e-01 9.30619466e-02
4.52669897e-02 1.08348058e-01 9.88146259e-03 2.26292358e-01
5.08579194e-02 4.46232554e-02 1.22740419e-02 6.84545170e-02
6.71339545e-03 1.33252061e-02 1.69296016e-02 3.81318236e-02
4.74945604e-02 1.59313146e-01 9.73037448e-03 9.95771327e-03
6.93777954e-02 3.87738690e-02 1.53693158e-01 9.24603716e-04
1.22473138e-01 2.73347277e-01 1.69060817e-02 1.10868365e-02
8.62029628e+00]
**********************
[ 8.21299526e-01 2.86878266e-01 1.43400668e-11 0.00000000e+00
0.00000000e+00 5.59436980e-05 8.84899894e-01 9.77244281e-01
9.99983411e-01 9.74912223e-01 6.02841813e-01 1.26903019e-01
9.99584918e-01 1.00000000e+00 7.88884155e-01 9.58633878e-01
9.96573548e-01 8.60719653e-01 8.07347364e-01 9.92656816e-01
9.97473024e-01 9.90817144e-01 9.99739526e-01 9.73237195e-01
9.96995722e-01 9.97526259e-01 9.99639669e-01 9.95333185e-01
9.99853998e-01 9.99592531e-01 9.99417113e-01 9.98042114e-01
9.97286030e-01 9.83873717e-01 9.99745466e-01 9.99736512e-01
9.95239765e-01 9.97992843e-01 9.84693908e-01 9.99992525e-01
9.89010468e-01 9.64960636e-01 9.99418323e-01 9.99690553e-01
3.47893682e-02]
Upvotes: 2
Views: 3934
Reputation: 879471
If you put some print statements in the code
defining
chi2
,
def chi2(X, y):
X = atleast2d_or_csr(X)
Y = LabelBinarizer().fit_transform(y)
if Y.shape[1] == 1:
Y = np.append(1 - Y, Y, axis=1)
observed = safe_sparse_dot(Y.T, X) # n_classes * n_features
print(repr(observed))
feature_count = array2d(X.sum(axis=0))
class_prob = array2d(Y.mean(axis=0))
expected = safe_sparse_dot(class_prob.T, feature_count)
print(repr(expected))
return stats.chisquare(observed, expected)
you'll see that expected
ends up having some negative
values.
import numpy as np
import sklearn.feature_selection as FS
x = np.array([-0.23918515, -0.29967287, -0.33007592, 0.07383528, -0.09205183,
-0.12548226, 0.04770942, -0.54318463, -0.16833203, -0.00332341,
0.0179646, -0.0526383, 0.04288736, -0.27427317, -0.16136621,
-0.09228812, -0.2255725, -0.03744027, 0.02953499, -0.17387492])
y = np.array([1, 2, 2, 1, 1, 1, 1, 3, 1, 1, 3, 2, 2, 1, 1, 2, 1, 2, 1, 1],
dtype = 'int64')
FS.chi2(x.reshape(-1,1),y)
yields
observed:
array([[-1.31238179],
[-0.76922812],
[-0.52522003]])
expected:
array([[-1.56409796],
[-0.78204898],
[-0.26068299]])
stats.chisquared(observed, expected)
is then called. There, observed
and expected
are assumed to be frequencies of categories. They should all be
non-negative numbers since frequencies are non-negative.
I'm not familiar enough with scikits-learn to suggest how your problem should be fixed, but it appears that the kind of data you are sending to chi2
is of the wrong sort, since expected
should be non-negative.
(e.g. Could it be that the x
values above should all be positive and represent frequencies of observations?)
Upvotes: 1