Reputation: 406
Lately I am doing a research with purpose of unsupervised clustering of a huge texts database. Firstly I tried bag-of-words and then several clustering algorithms which gave me a good result, but now I am trying to step into doc2vec representation and it seems to not be working for me, I cannot load prepared model and work with it, instead training my own doesnt prove any result.
I tried to train my model on 10k texts
model = gensim.models.doc2vec.Doc2Vec(vector_size=500, min_count=2, epochs=100,workers=8)
(around 20-50 words each) but the similarity score which is proposed by gensim like
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
is working much worse than the same for Bag-of-words with my model. By much worse i mean that identical or almost identical text have similarity score compatible to text which dont have any connection i can think about. So i decided to use model from Is there pre-trained doc2vec model? to use some pretrained model which might have more connections between words. Sorry for somewhat long preambula but the question is how do i plug it in? Can someone provide some ideas how do i, using the loaded gensim model from https://github.com/jhlau/doc2vec convert my own dataset of text into vectors of same length? My data is preprocesssed (stemmed, no punctuation, lowercase, no nlst.corpus stopwords)and i can deliver it from list or dataframe or file if needed, the code question is how to pass my own data to pretrained model? Any help would be appreciated.
UPD: outputs that make me feel bad
Train Document (6134): «use medium paper examination medium habit one week must chart daily use medium radio television newspaper magazine film video etc wake radio alarm listen traffic report commuting get news watch sport soap opera watch tv use internet work home read book see movie use data collect journal basis analysis examining information using us gratification model discussed textbook us gratification article provided perhaps carrying small notebook day inputting material evening help stay organized smartphone use note app track medium need turn diary trust tell tell immediately paper whether actually kept one begin medium diary soon possible order give ample time complete journal write paper completed diary need write page paper use medium functional analysis theory say something best understood understanding used us gratification model provides framework individual use medium basis analysis especially category discussed posted dominick article apply concept medium usage expected le medium use cognitive social utility affiliation withdrawal must draw conclusion use analyzing habit within framework idea discussed text article concept must clearly included articulated paper common mistake student make assignment tell medium habit fail analyze habit within context us gratification model must include idea paper»
Similar Document (6130, 0.6926988363265991): «use medium paper examination medium habit one week must chart daily use medium radio television newspaper magazine film video etc wake radio alarm listen traffic report commuting get news watch sport soap opera watch tv use internet work home read book see movie use data collect journal basis analysis examining information using us gratification model discussed textbook us gratification article provided perhaps carrying small notebook day inputting material evening help stay organized smartphone use note app track medium need turn diary trust tell tell immediately paper whether actually kept one begin medium diary soon possible order give ample time complete journal write paper completed diary need write page paper use medium functional analysis theory say something best understood understanding used us gratification model provides framework individual use medium basis analysis especially category discussed posted dominick article apply concept medium usage expected le medium use cognitive social utility affiliation withdrawal must draw conclusion use analyzing habit within framework idea discussed text article concept must clearly included articulated paper common mistake student make assignment tell medium habit fail analyze habit within context us gratification model must include idea paper»
This looks perfectly ok, but looking on other outputs
Train Document (1185): «photography garry winogrand would like paper life work garry winogrand famous street photographer also influenced street photography aim towards thoughtful imaginative treatment detail referencescite research material academic essay university level»
Similar Document (3449, 0.6901006698608398): «tang dynasty write page essay tang dynasty essay discus buddhism tang dynasty name artifact tang dynasty discus them history put heading paragraph information tang dynasty discussed essay»
Shows us that the score of similarity between two exactly same texts which are the most similar in the system and two like super distinct is almost the same, which makes it problematic to do anything with the data. To get most similar documents i use
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
Upvotes: 1
Views: 1046
Reputation: 54243
The models from https://github.com/jhlau/doc2vec are based on a custom fork of an older version of gensim, so you'd have to find/use that to make them usable.
Models from a generic dataset (like Wikipedia) may not understand the domain-specific words you need, and even where words are shared, the effective senses of those words may vary. Also, to use another model to infer vectors on your data, you should ensure you're preprocessing/tokenizing your text in the same way as the training data was processed.
Thus, it's best to use a model you've trained yourself – so you fully understand it – on domain-relevant data.
10k documents of 20-50 words each is a bit small compared to published Doc2Vec
work, but might work. Trying to get 500-dimensional vectors from a smaller dataset could be a problem. (With less data, fewer vector dimensions and more training iterations may be necessary.)
If your result on your self-trained model are unsatisfactory, there could be other problems in your training and inference code (that's not shown yet in your question). It would also help to see more concrete examples/details of how your results are unsatisfactory, compared to a baseline (like the bag-of-words representations you mention). If you add these details to your question, it might be possible to offer other suggestions.
Upvotes: 2