Reputation: 2342
Currently I am trying to perform text classification on a text corpus. In order to do so, I have decided to perform word2vec
with the help of gensim
. In order to do so, I have the code below:
sentences = MySentences("./corpus_samples") # a memory-friendly iterator
model = gensim.models.Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
My sentences is basically a class that handles the File I/O
class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield line.split()
Now we can get the vocabulary of the model that has been created through these lines:
print(model.wv.vocab)
The output of which is below(sample):
t at 0x106f19438>, 'raining.': <gensim.models.keyedvectors.Vocab object at 0x106f19470>, 'fly': <gensim.models.keyedvectors.Vocab object at 0x106f194a8>, 'rain.': <gensim.models.keyedvectors.Vocab object at 0x106f194e0>, 'So…': <gensim.models.keyedvectors.Vocab object at 0x106f19518>, 'Ohhh,': <gensim.models.keyedvectors.Vocab object at 0x106f19550>, 'weird.': <gensim.models.keyedvectors.Vocab object at 0x106f19588>}
As of now, the dictionary that is the vocabulary, contains the word string and a <gensim.models.keyedvectors.Vocab object at 0x106f19588>
object or such. I want to be able to query an index of a particular word. In order to make my training data like:
w91874 w2300 w6 w25363 w6332 w11 w767 w297441 w12480 w256 w23270 w13482 w22236 w259 w11 w26959 w25 w1613 w25363 w111 __label__4531492575592394249
w17314 w5521 w7729 w767 w10147 w111 __label__1315009618498473661
w305 w6651 w3974 w1005 w54 w109 w110 w3974 w29 w25 w1513 w3645 w6 w111 __label__-400525901828896492
w30877 w72 w11 w2828 w141417 w77033 w10147 w111 __label__4970306416006110305
w3332 w1107 w4809 w1009 w327 w84792 w6 w922 w11 w2182 w79887 w1099 w111 __label__-3645735357732416904
w471 w14752 w1637 w12348 w72 w31330 w930 w11569 w863 w25 w1439 w72 w111 __label__-5932391056759866388
w8081 w5324 w91048 w875 w13449 w1733 w111 __label__3812457715228923422
Where the wxxxx
represents the index of the word within the vocabulary and the label represents the class.
Some of the solutions that I have been experimenting with, is the corpora
utility of gensim
:
corpora = gensim.corpora.dictionary.Dictionary(sentences, prune_at=2000000)
print(corpora)
print(getKey(corpora,'am'))
This gives me a nice dictionary of the words, but this corpora vocabulary is not the same as the one created by the word2vec
function mentioned above.
Upvotes: 0
Views: 938
Reputation: 583
TL;DR:
model.wv.vocab['my_word'].index
where 'my_word'
is the word whose index you want (Eg. 'hello'
, 'the'
, etc).
Long Story:
This is so because gensim stores the Vocab
object in the model.wv.vocab
dictionary.
That is the reason you get results like'raining.': <gensim.models.keyedvectors.Vocab object at 0x106f19470>
when you try to print the dict.
The Vocab
object is initialized with the index like so:
wv.vocab[word] = Vocab(count=v, index=len(wv.index2word))
and thus allows access to this property.
I don't understand why you would need to represent it so, but this should do the trick.
More details can be found in their source
Upvotes: 1