Reputation: 473
I am doing text classification of documents, I have around 4k categories and 1.1 million data samples.
I am constructing matrix which contain frequency of words in each document. The sample of matrix looks as below
X1 X2 X3 X4
D1 1 1 0 1
D2 1 1 1 0
D3 1 1 0 0
D4 1 1 1 1
D5 0 0 1 0
D6 0 0 1 1
In above matrix, X1 and X2 are redundant features because they have same values in all rows.
First when I construct matrix from 1.1 million data, I will get huge matrix with 90k features.
To reduce matrix dimension, I am using dimension reduction technique PCA I have used TruncatedSVD to calculate PCA as I am using sparse matrix.
I am using Sckit learn implementation of PCA using below code
from sklearn.decomposition import TruncatedSVD
X = [[1,1,0,1], [1,1,1,0], [1,1,0,0],[1,1,1,1],[0,0,1,0],[0,0,1,1]]
svd = TruncatedSVD(n_components=3)
svd.fit(X)
X_new=svd.fit_transform(X)
The output of X_new is
array([[ 1.53489494, -0.49612748, -0.63083679],
[ 1.57928583, -0.04762643, 0.70963934],
[ 1.13759356, -0.80736818, 0.2324597 ],
[ 1.97658721, 0.26361427, -0.15365716],
[ 0.44169227, 0.75974175, 0.47717963],
[ 0.83899365, 1.07098246, -0.38611686]])
This is the reduced dimension I got I am giving X_new as input to Naive Bayes classifier.
clf = GaussianNB()
model=clf.fit(X_new, Y)
For 1.1 million sample I got below outputs:
No_of_components
(“n_components” parameter) accuracy
1000 6.57%
500 7.25%
100 5.72%
I am getting very low accuracy,
Whether above steps are correct?
What are the things I need to include further?
Upvotes: 1
Views: 4761
Reputation: 11424
The accuracy is low, because you lose most information during dimensionality rediction.
You can check it with sum(svd.explained_variance_ratio_ )
. This number, like R^2
, measures precision of your model: it equals 1 if all information is preserved by SVD, and 0, if no information is preserved. In your case (3 dimensions of 90K features) I expect it to be of order 0.1%.
For your problem, I would recommend one of the two strategies.
1. Do not reduce dimensions mathematically. Instead, preprocess your text lingustically: drop the stop-words, stem or lemmatize the rest of words, and drop the words which occure less than k
times. It will bring your dimensionality from 90K to something like 15K without serious loss of information.
On these features you can train a sparse model (like SGDClassifier
with huge L1 penalty), which could bring the number of actually used features down to something like 1K with still a good accuracy. It sometimes helps to transform your word-counts with TF-IDF before feeding to a linear classifier.
2. Use a pre-trained dimensionality reducer, like word2vec
or fastText
, to extract features from your text. There exist pre-trained word2vec models in the Internet, for multiple languages, and several dimensionalities (like 200, 1000, etc.).
Upvotes: 2