Chedi Bechikh
Chedi Bechikh

Reputation: 173

Loading LSA sklearn vector

I trained an LSA model with sklearn, this model was saved with pickle.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
import numpy as np
import os.path
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import pickle


def load_data(path,file_name):
    """
    Input  : path and file_name
    Purpose: loading text file
    Output : list of paragraphs/documents and
             title(initial 100 words considered as title of document)
    """
    documents_list = []
    titles=[]
    with open( os.path.join(path, file_name) ,"r") as fin:
        for line in fin.readlines():
            text = line.strip()
            documents_list.append(text)
    print("Total Number of Documents:",len(documents_list))
    titles.append( text[0:min(len(text),100)] )
    return documents_list,titles

document_list,titles=load_data("","a-choose")
#clean_text=preprocess_data(document_list)


# raw documents to tf-idf matrix: 

vectorizer = TfidfVectorizer(stop_words='english', 
                             use_idf=True, 
                             smooth_idf=True)

# SVD to reduce dimensionality: 

svd_model = TruncatedSVD(n_components=4,
                         algorithm='randomized',
                         n_iter=10)

# pipeline of tf-idf + SVD, fit to and applied to documents:

svd_transformer = Pipeline([('tfidf', vectorizer), 
                            ('svd', svd_model)])

svd_matrix = svd_transformer.fit_transform(document_list)

# svd_matrix can later be used to compare documents, compare words, or compare queries with documents
sentence=["football"]
sentence2=["match"]

query=svd_transformer.transform(sentence2)

query_vector = svd_transformer.transform(sentence)

#print(query_vector)
#print(query)


with open("lsa_model.bin","wb") as f:
    pickle.dump(svd_matrix, f)

As a second step, I use another program that load this model, which will compare word vectors. The problem I am not able to load these vectors, my code is below

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
import numpy as np
import numpy as np
from gensim.models import KeyedVectors
import codecs
import pickle

model = pickle.load(open('lsa_model.bin', 'rb'))
query="best"

query_vector = model.transform(query)

print(query_vector)

This generate an error

query_vector = model.transform(query) AttributeError: 'numpy.ndarray' object has no attribute 'transform'

Upvotes: -1

Views: 143

Answers (1)

Kafka
Kafka

Reputation: 41

I think you need to use just fit here instead of fit_transform :

svd_matrix = svd_transformer.fit(document_list)

I am not sure why it works only in the second part

Upvotes: 0

Related Questions