Reputation: 1920

why does tfidf object takes so much space?

I have roughly 100,000 long articles totally about 5GB of texts, when I perform

TfidfVectorizer

from sklearn it constructs a model with 6GB. How is that possible? Isn't that we only need to store the document frequency of that 4000 words and what that 4000 words are? I am guessing TfidfVectorizer of stores such 4000 dimension vector for every document. Is it possible somehow I have some settings wrongly set?

Upvotes: 1

Answers (2)

OldWolfs

Reputation: 626

I know there is an answer but some additional information to consider for others. When you directly pickle the TFIDFVectorizer you also saving stop words attribute of the vectorizer but that is not necessary after vocabulary is established. In one of our models, there were 3000 words in vocabulary but saved model occupied 250MB space so inspecting the model we saw 10 Million stop words also is stored with the model. Then we saw the following warning at TfidfVectorizer

"The stop_words_ attribute can get large and increase the model size when pickling. This attribute is provided only for introspection and can be safely removed using delattr or set to None before pickling."

Applying that reduced our model size significantly.

Upvotes: 2

Denis Gordeev

Reputation: 464

A TF-IDF matrix shape is (number_of_documents, number_of_unique_words). So for each document you get a feature for each word from the dataset. It can get bloated for large datasets.

In your case (100000 (docs) * 4000 (words) * 4 (np.float64 bytes))/1024**3 ~ 1.5 Gb

Moreover, the Scipy TfidfVectorizer by default tries to compensate it using a sparse matrix (scipy.sparse.csr.csr_matrix). Even for long documents the matrix tends to contain lots of zeros. So it is usually an order less than the original size. If I am correct, it should be lower than 1.5 GB.

Thus is the question. Do you really have only 4000 words in your model (controlled by TfidfVectorizer(max_features=4000)?

If you don't care about individual word frequencies you can decrease the vector size using PCA or other techniques.

    dense_matrix = tf_idf_matrix.todense()
    components_number = 300
    reduced_data = PCA(n_components=300).fit_transform(dense_matrix)

Or you can use something like doc2vec. https://radimrehurek.com/gensim/models/doc2vec.html

Using it you'll get the matrix of the shape (number_of_documents, embedding_size). The embedding size is usually in the range between (100 and 600). You can train a doc2vec model without storing individual word vectors using the dbow_words parameter.

If you care about individual word features, the only reasonable solution that I see is to decrease the amount of words.

Relevant stackoverflow posts:

----On dimensinality reduction

How do i visualize data points of tf-idf vectors for kmeans clustering?

----On using generators to train TFIDF

Sklearn TFIDF on large corpus of documents

How to get tf-idf matrix of a large size corpus, where features are pre-specified?

tf-idf on a somewhat large (65k) amount of text files

Models itself should not occupy so much space. I suppose it is possible if only you have some heavy objects in TfidfVectorizer tokenizer or preprocessor attributes.

class Tokenizer: 
        def __init__(self): 
            self.s = np.random.uniform(0,1, size=(10000,10000)) 
        def tokenizer(self, text): 
            text = text.lower().split() 
            return text
    tokenizer = Tokenizer()                                                                                                                                                   
    vectorizer = TfidfVectorizer(tokenizer=tokenizer.tokenizer)
    pickle.dump(vectorizer, open("vectorizer.pcl", "wb"))

This will occupy more than 700mb after pickling.

Upvotes: 3

why does tfidf object takes so much space?

Answers (2)

Related Questions