Reputation: 31
I wanted to write the code to find the similarity between two sentences and then I ended up writing this code using nltk and gensim. I used tokenization and gensim.similarities.Similarity to do the work. But it ain't serving my purpose. It works fine until I introduce the last line of code.
import gensim
import nltk
raw_documents = ["I'm taking the show on the road.",
"My socks are a force multiplier.",
"I am the barber who cuts everyone's hair who doesn't cut their
own.",
"Legend has it that the mind is a mad monkey.",
"I make my own fun."]
from nltk.tokenize import word_tokenize
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in raw_documents]
dictionary = gensim.corpora.Dictionary(gen_docs)
print(dictionary[5])
print(dictionary.token2id['socks'])
print("Number of words in dictionary:",len(dictionary))
for i in range(len(dictionary)):
print(i, dictionary[i])
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
print(corpus)
tf_idf = gensim.models.TfidfModel(corpus)
print(tf_idf)
s = 0
for i in corpus:
s += len(i)
print(s)
sims = gensim.similarities.Similarity('/usr/workdir/',tf_idf[corpus],
num_features=len(dictionary))
print(sims)
print(type(sims))
query_doc = [w.lower() for w in word_tokenize("Socks are a force for good.")]
print(query_doc)
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)
query_doc_tf_idf = tf_idf[query_doc_bow]
print(query_doc_tf_idf)
sims[query_doc_tf_idf]
It throws this error. I couldn't find the answer for this anywhere on the internet.
Traceback (most recent call last):
File "C:\Python36\lib\site-packages\gensim\utils.py", line 679, in save
_pickle.dump(self, fname_or_handle, protocol=pickle_protocol)
TypeError: file must have a 'write' attribute
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "semantic.py", line 45, in <module>
sims[query_doc_tf_idf]
File "C:\Python36\lib\site-packages\gensim\similarities\docsim.py", line
503, in __getitem__
self.close_shard() # no-op if no documents added to index since last
query
File "C:\Python36\lib\site-packages\gensim\similarities\docsim.py", line
427, in close_shard
shard = Shard(self.shardid2filename(shardid), index)
File "C:\Python36\lib\site-packages\gensim\similarities\docsim.py", line
110, in __init__
index.save(self.fullname())
File "C:\Python36\lib\site-packages\gensim\utils.py", line 682, in save
self._smart_save(fname_or_handle, separately, sep_limit, ignore,
pickle_protocol=pickle_protocol)
File "C:\Python36\lib\site-packages\gensim\utils.py", line 538, in
_smart_save
pickle(self, fname, protocol=pickle_protocol)
File "C:\Python36\lib\site-packages\gensim\utils.py", line 1337, in pickle
with smart_open(fname, 'wb') as fout: # 'b' for binary, needed on
Windows
File "C:\Python36\lib\site-packages\smart_open\smart_open_lib.py", line
181, in smart_open
fobj = _shortcut_open(uri, mode, **kw)
File "C:\Python36\lib\site-packages\smart_open\smart_open_lib.py", line
287, in _shortcut_open
return io.open(parsed_uri.uri_path, mode, **open_kwargs)
Please help figure out where the problem is.
Upvotes: 3
Views: 2381
Reputation: 4004
Your query should work if you specify a valid path when you instantiate your Similarity. For the example below, I have created a directory Similarity on my C-drive and have specified the directory path and a name for the file in the function call.
sims = gensim.similarities.Similarity('C:/Similarity/sims',tf_idf[corpus],
num_features=len(dictionary))
print(sims)
print(type(sims))
query_doc = [w.lower() for w in word_tokenize("Socks are a force for good.")]
print(query_doc)
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)
query_doc_tf_idf = tf_idf[query_doc_bow]
print(query_doc_tf_idf)
print('Query result:', sims[query_doc_tf_idf])
Query result: [0. 0.84565616 0. 0.06124881 0. ]
Upvotes: 2