Reputation: 5380
I have a matrix of size (61964, 25). Here is a sample:
array([[ 1., 0., 0., 4., 0., 1., 0., 0., 0., 0., 3.,
0., 2., 1., 0., 0., 3., 0., 3., 0., 14., 0.,
2., 0., 4.],
[ 0., 0., 0., 1., 2., 0., 0., 0., 0., 0., 1.,
0., 2., 0., 0., 0., 0., 0., 0., 0., 5., 0.,
0., 0., 1.]])
Scikit-learn provides a useful function provided that our data are normally distributed:
from sklearn import preprocessing
X_2 = preprocessing.scale(X[:, :3])
My problem, however, is that I have to work on a row basis - which does not consist of 25 observations only - and so the normal distribution is not applicable here. The solution is to use t-distribution but how can I do that in Python?
Normally, values go from 0 to, say, 20. When I see unusually high numbers, I filter out the whole row. The following histogram shows what my actual distribution looks like:
Upvotes: 4
Views: 5134
Reputation: 176810
scipy.stats
has the function zscore
which allows you to calculate how many standard deviations a value is above the mean (often refered to as the standard score or Z score).
If arr
is the example array from your question, then you can compute the Z score across each row of 25 as follows:
>>> import scipy.stats as stats
>>> stats.zscore(arr, axis=1)
array([[-0.18017365, -0.52666143, -0.52666143, 0.8592897 , -0.52666143,
-0.18017365, -0.52666143, -0.52666143, -0.52666143, -0.52666143,
0.51280192, -0.52666143, 0.16631414, -0.18017365, -0.52666143,
-0.52666143, 0.51280192, -0.52666143, 0.51280192, -0.52666143,
4.32416754, -0.52666143, 0.16631414, -0.52666143, 0.8592897 ],
[-0.43643578, -0.43643578, -0.43643578, 0.47280543, 1.38204664,
-0.43643578, -0.43643578, -0.43643578, -0.43643578, -0.43643578,
0.47280543, -0.43643578, 1.38204664, -0.43643578, -0.43643578,
-0.43643578, -0.43643578, -0.43643578, -0.43643578, -0.43643578,
4.10977027, -0.43643578, -0.43643578, -0.43643578, 0.47280543]])
This calculation uses the population mean and standard deviation for each row. To use the sample variance instead (as with the t-statistic), additionally specify ddof=1
:
stats.zscore(arr, axis=1, ddof=1)
Upvotes: 5