Swan87
Swan87

Reputation: 421

Load extracted vectors to TfidfVectorizer

I am looking for a way to load vectors I generated previously using scikit-learn's TfidfVectorizer. In general what I wish is to get a better understanding of the TfidfVectorizer's data persistence.

For instance, what I did so far is:

vectorizer = TfidfVectorizer(stop_words=stop)
vect_train = vectorizer.fit_transform(corpus)

Then I wrote 2 functions in order to be able to save and load my vectorizer:

def save_model(model,name):
    '''
    Function that enables us to save a trained model

    '''
    joblib.dump(model, '{}.pkl'.format(name)) 


def load_model(name):
    '''
    Function that enables us to load a saved model

    '''
    return joblib.load('{}.pkl'.format(name))

I checked posts like the one below but i still didn't manage to make much sense.

How do I store a TfidfVectorizer for future use in scikit-learn?

What I ultimately wish is to be able to have a training session and then load this set of produced vectors, transform some newly text input based on those vectors and perform cosine_similarity using old vectors and new ones generated based on them.

One of the reasons that I wish to do this is because the vectorization in such a large dataset takes approximately 10 minutes and I wish to do this once and not every time a new query comes in.

I guess what I should be saving is vect_train right? But then which is the correct way to firstly save it and then load it to a newly created instance of TfidfVectorizer?

First time I tried to save vect_train with joblib as the kind people in scikit-learn advise to do I got 4 files: tfidf.pkl, tfidf.pkl_01.npy, tfidf.pkl_02.npy, tfidf.pkl_03.npy. It would be great if I knew what exactly are those and how I could load them to a new instance of

vectorizer = TfidfVectorizer(stop_words=stop)

created in a different script.

Thank you in advance.

Upvotes: 2

Views: 2964

Answers (1)

geompalik
geompalik

Reputation: 1582

The result of your vect_train = vectorizer.fit_transform(corpus) is twofold: (i) the vectorizer fits your data, that is it learns the corpus vocabulary and the idf for each term, and (ii) vect_train is instantiated with the vectors of your corpus.

The save_model and load_model functions you propose persist and load the vectorizer, that is the internal parameters that it has learned such as the vocabulary and the idfs. Having loaded the vectorizer, all you need to get vectors is to transform a list with data. It can be unseen data, or the raw data you used during the fit_transform. Therefore, all you need is:

vectorizer = load_model(name)
vect_train = vectorizer.transform(corpus) # (1) or any unseen data

At this point, you have everything you had before saving, but the transformation call (1) will take some time depending on your corpus. In case you want to skip this, you need to also save the content of vect_train, as you correctly wonder in your question. This is a sparse matrix and can be saved/loaded using scipy, you can find information in this question for example. Copying from that question, to actually save the csr matrices you also need:

def save_sparse_csr(filename,array):
    np.savez(filename,data = array.data ,indices=array.indices,
             indptr =array.indptr, shape=array.shape )

def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((  loader['data'], loader['indices'], loader['indptr']),
                         shape = loader['shape'])

Concluding, the above functions can be used for saving/loading your vec_train whereas the ones you provided for saving/loading the transformer in order to vectorize the new data.

Upvotes: 4

Related Questions