Reputation: 31
I've been using doc2vec in the most basic way so far with limited success. I'm able to find similar documents however often I get a lot of false positives. My primary goal is to build a classification algorithm for user requirements. This is to help with user requirement analysis and search.
I know this is not really a large enough dataset so there are a few questions I'd like help with:
I've been calling train once with 100 vectors on 2000 documents, each with about 100 words, each document has 22 columns which are tagged by both cell and row.
def tag_dataframe(df, selected_cols):
tagged_cells = []
headers = list(df.columns.values)
for index, row in df.iterrows():
row_tag = 'row_' + str(index)
for col_name in headers:
if col_name in selected_cols:
col_tag = 'col_' + col_name
cell_tag = 'cell_' + str(index) + '_' + col_name
cell_val = str(row[col_name])
if cell_val == 'nan':
continue
cleaned_text = clean_str(cell_val)
if len(cleaned_text) == 0:
continue
tagged_cells.append(
gensim.models.doc2vec.TaggedDocument(
cleaned_text,
[row_tag, cell_tag]))
print('tagged rows')
return tagged_cells
def load_or_build_vocab(model_path, tagged_cells):
if os.path.exists(model_path):
print('Loading vocab')
d2vm = gensim.models.Doc2Vec.load(model_path)
else:
print('building vocab')
d2vm = gensim.models.Doc2Vec(
vector_size=100,
min_count=0,
alpha=0.025,
min_alpha=0.001)
d2vm.build_vocab(tagged_cells)
print(' built')
d2vm.save(model_path)
return d2vm
def load_or_train_model(model_path, d2vm, tagged_cells):
if os.path.exists(model_path):
print('Loading Model')
d2vm = gensim.models.Doc2Vec.load(model_path)
else:
print('Training Model')
d2vm.train(
tagged_cells,
total_examples=len(tagged_cells),
epochs=100)
print(' trained')
d2vm.save(model_path)
return d2vm
What I hope to achieve is a set of document vectors which will help with finding similar user requirements from a free text and a Hierarchical Clustering to build navigation of the existing requirements.
Upvotes: 3
Views: 733
Reputation: 54153
You should look at the doc2vec-
Jupyter notebooks bundled with gensim in its docs/notebooks
directory (or viewable online) for more examples of proper use. Looking through existing SO answers on the tag doc2vec
(and perhaps especially my answers) may also give you an idea of common mistakes.)
To tune the model in an unsupervised setting, you essentially need some domain-specific repeatable evaluation score. This might require going through your whole clustering & end-application, then counting its success on certain results it "should" give for a hand-created subset of your data.
For comparison, if you look at the original 'Paragraph Vector' paper, it used existing batches of top-10 search-results snippets from an existing search engine as the training documents, but then scored any model by how well it put snippets that were in a shared-top-10 closer to each other than to random 3rd documents. The followup paper 'Document Embedding with Paragraph Vectors' trained on Wikipedia articles or Arxiv papers, and tuned their model based on how well the resulting model put documents into the same pre-curated categories that exist on those systems.
You can use any clustering algorithms on the per-document vectors. The output of Doc2Vec
, as a document-per-vector, can become the input of downstream algorithms. (I'm not sure what you mean about "separate word and document classification models". You've only described document-level final needs, you might not need word-vectors at all... though some Doc2Vec
modes will create such vectors.)
You use the infer_vector()
method to create vectors for novel documents, after the model has been trained and frozen.
Looking at the specifics of your data/code, some observations:
Doc2Vec
work operates on tens-of-thousands to millions of documents. This algorithm works best with more data.Doc2Vec
supports giving documents multiple tags, as you've done here, it's best considered an advanced technique. It essentially dilutes what can be learned from a doc across the multiple tags, which could weaken the results, especially in small datasets.infer_vector()
operations (unless another value is explicitly passed there).Word2Vec
and Doc2Vec
often do better discarding rare words (with the default min_count=5
or larger when practical) than trying to train on them. Words that only appear one or a few times are often idiosyncratic in their usage, compared to the "true" importance of the word in the larger world. Keeping them makes models larger, slower to train, and more likely to reflect idiosyncracies of the data than generalizable patterns. Upvotes: 3