Reputation: 1920
I have roughly 100,000 long articles totally about 5GB of texts, when I perform
TfidfVectorizer
from sklearn it constructs a model with 6GB. How is that possible? Isn't that we only need to store the document frequency of that 4000 words and what that 4000 words are? I am guessing TfidfVectorizer of stores such 4000 dimension vector for every document. Is it possible somehow I have some settings wrongly set?
Upvotes: 1
Views: 2773
Reputation: 626
I know there is an answer but some additional information to consider for others. When you directly pickle the TFIDFVectorizer you also saving stop words attribute of the vectorizer but that is not necessary after vocabulary is established. In one of our models, there were 3000 words in vocabulary but saved model occupied 250MB space so inspecting the model we saw 10 Million stop words also is stored with the model. Then we saw the following warning at TfidfVectorizer
"The stop_words_ attribute can get large and increase the model size when pickling. This attribute is provided only for introspection and can be safely removed using delattr or set to None before pickling."
Applying that reduced our model size significantly.
Upvotes: 2
Reputation: 464
A TF-IDF matrix shape is (number_of_documents, number_of_unique_words). So for each document you get a feature for each word from the dataset. It can get bloated for large datasets.
In your case
(100000 (docs) * 4000 (words) * 4 (np.float64 bytes))/1024**3 ~ 1.5 Gb
Moreover, the Scipy TfidfVectorizer by default tries to compensate it using a sparse matrix (scipy.sparse.csr.csr_matrix). Even for long documents the matrix tends to contain lots of zeros. So it is usually an order less than the original size. If I am correct, it should be lower than 1.5 GB.
Thus is the question. Do you really have only 4000 words in your model (controlled by TfidfVectorizer(max_features=4000)
?
If you don't care about individual word frequencies you can decrease the vector size using PCA or other techniques.
dense_matrix = tf_idf_matrix.todense()
components_number = 300
reduced_data = PCA(n_components=300).fit_transform(dense_matrix)
Or you can use something like doc2vec. https://radimrehurek.com/gensim/models/doc2vec.html
Using it you'll get the matrix of the shape (number_of_documents, embedding_size). The embedding size is usually in the range between (100 and 600). You can train a doc2vec model without storing individual word vectors using the dbow_words
parameter.
If you care about individual word features, the only reasonable solution that I see is to decrease the amount of words.
Relevant stackoverflow posts:
----On dimensinality reduction
How do i visualize data points of tf-idf vectors for kmeans clustering?
----On using generators to train TFIDF
Sklearn TFIDF on large corpus of documents
How to get tf-idf matrix of a large size corpus, where features are pre-specified?
tf-idf on a somewhat large (65k) amount of text files
Models itself should not occupy so much space. I suppose it is possible if only you have some heavy objects in TfidfVectorizer
tokenizer
or preprocessor
attributes.
class Tokenizer:
def __init__(self):
self.s = np.random.uniform(0,1, size=(10000,10000))
def tokenizer(self, text):
text = text.lower().split()
return text
tokenizer = Tokenizer()
vectorizer = TfidfVectorizer(tokenizer=tokenizer.tokenizer)
pickle.dump(vectorizer, open("vectorizer.pcl", "wb"))
This will occupy more than 700mb after pickling.
Upvotes: 3