Reputation: 4164
I am using SciKit Learn to perform some analytics on a large dataset (+- 34.000 files). Now I was wondering. The HashingVectorizer aims on low memory usage. Is it possible to first convert a bunch of files to HashingVectorizer objects (using pickle.dump) and then load all these files together and convert them to TfIdf features? These features can be calculated from the HashingVectorizer, because counts are stored and the number of documents can be deduced. I now have the following:
for text in texts:
vectorizer = HashingVectorizer(norm=None, non_negative=True)
features = vectorizer.fit_transform([text])
with open(path, 'wb') as handle:
pickle.dump(features, handle)
Then, loading the files is trivial:
data = []
for path in paths:
with open(path, 'rb') as handle:
data.append(pickle.load(handle))
tfidf = TfidfVectorizer()
tfidf.fit_transform(data)
But, the magic does not happen. How can I let the magic happen?
Upvotes: 3
Views: 4128
Reputation: 8270
It seems the problem is you are trying to vectorizing your text twice. Once you have built a matrix of counts, you should be able to transform the counts to tf-idf features using sklearn.feature_extraction.text.TfidfTransformer
instead of TfidfVectorizer
.
Also, it appears your saved data is a sparse matrix. You should be stacking the loaded matrices using scipy.sparse.vstack()
instead of passing a list of matrices to TfidfTransformer
Upvotes: 5
Reputation: 36545
I'm quite worried by your loop
for text in texts:
vectorizer = HashingVectorizer(norm=None, non_negative=True)
features = vectorizer.fit_transform([text])
Each time you re-fit your vectoriser, maybe it will forget its vocabulary, and so the entries in each vector won't correspond to the same words (not sure about this i guess it depends on how they do the hashing); why not just fit it on the whole corpus, i.e.
features = vectorizer.fit_transform(texts)
For you actual question, it sounds like you are just trying to normalise the columns of your data
matrix by the IDF; you should be able to do this directly on the arrays (i've converted to numpy arrays since I can't work out how the indexing works on the scipy arrays). The mask DF != 0
is necessary since you used the hashing vectoriser which has 2^20 columns:
import numpy as np
X = np.array(features.todense())
DF = (X != 0).sum(axis=0)
X_TFIDF = X[:,DF != 0]/DF[DF != 0]
Upvotes: -1