Reputation: 3029
For supervised learning, my matrix size is really huge as a result of which only certain models agree to run with it. I read that PCA can help reducing dimensionality to a large extent.
Below is my code:
def run(command):
output = subprocess.check_output(command, shell=True)
return output
f = open('/Users/ya/Documents/10percent/Vik.txt','r')
vocab_temp = f.read().split()
f.close()
col = len(vocab_temp)
print("Training column size:")
print(col)
#dataset = list()
row = run('cat '+'/Users/ya/Documents/10percent/X_true.txt'+" | wc -l").split()[0]
print("Training row size:")
print(row)
matrix_tmp = np.zeros((int(row),col), dtype=np.int64)
print("Train Matrix size:")
print(matrix_tmp.size)
# label_tmp.ndim must be equal to 1
label_tmp = np.zeros((int(row)), dtype=np.int64)
f = open('/Users/ya/Documents/10percent/X_true.txt','r')
count = 0
for line in f:
line_tmp = line.split()
#print(line_tmp)
for word in line_tmp[0:]:
if word not in vocab_temp:
continue
matrix_tmp[count][vocab_temp.index(word)] = 1
count = count + 1
f.close()
print("Train matrix is:\n ")
print(matrix_tmp)
print(label_tmp)
print(len(label_tmp))
print("No. of topics in train:")
print(len(set(label_tmp)))
print("Train Label size:")
print(len(label_tmp))
I wish to apply PCA to matrix_tmp as it has a size of about (202180x9984). How can I modify my code to include it?
Upvotes: 1
Views: 1230
Reputation: 3550
import codecs
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
with codecs.open('input_file', 'r', encoding='utf-8') as inf:
lines = inf.readlines()
vectorizer = CountVectorizer(binary=True)
X_train = vectorizer.fit_transform(lines)
perform_pca = False
if perform_pca:
n_components = 100
pca = TruncatedSVD(n_components)
X_train = pca.fit_transform(X_train)
1- Do the vectorization with available verctorizers in sklearn which produces sparse matrices instead of a full matrix with massive zero values.
2- Do the PCA only if needed
3- For performance play with the parameters of your vectorizer and pca if needed.
Upvotes: 1
Reputation: 8270
Scikit-learn provides several PCA implementations. One useful one is TruncatedSVD
. Its usage is fairly straightforward:
from sklearn.decomposition import TruncatedSVD
n_components=100
pca = TruncatedSVD(n_components)
matrix_reduced = pca.fit_transform(matrix_tmp)
Upvotes: 0