Bitswazsky
Bitswazsky

Reputation: 4698

similarity score is way off using doc2vec embedding

I'm trying out document de-duplication on an NY-Times corpus that I've prepared very recently. It contains data related to financial fraud.

First, I convert the article snippets to a list of TaggedDocument objects.

nlp = spacy.load("en_core_web_sm")

def create_tagged_doc(doc, nlp):        
    toks = nlp(doc)
    lemmatized_toks = [tok.lemma_ for tok in toks if not tok.is_stop]
    return lemmatized_toks

df_fraud = pd.read_csv('...local_path...')
df_fraud_list = df_fraud['snippet'].to_list()
documents = [TaggedDocument(create_tagged_doc(doc, nlp), [i]) for i, doc in enumerate(df_fraud_list)]

A sample TaggedDocument looks as follows:

TaggedDocument(words=['Chicago', 'woman', 'fall', 'mortgage', 'payment', 
'victim', 'common', 'fraud', 'know', 'equity', 'strip', '.'], tags=[1])

Now I compile and train the Doc2Vec model.

cores = multiprocessing.cpu_count()
model_dbow = Doc2Vec(dm=0, vector_size=100, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
model_dbow.build_vocab(documents)
model_dbow.train(documents, 
                total_examples=model_dbow.corpus_count, 
                epochs=model_dbow.epochs)

Let's define the cosine similarity:

cosine_sim = lambda x, y: np.inner(x, y) / (norm(x) * norm(y))

Now, the trouble is, if I define two sentences which are almost similar and take their cosine similarity score, it's coming very low. E.g.

a = model_dbow.infer_vector(create_tagged_doc('That was a fradulent transaction.', nlp))
b = model_dbow.infer_vector(create_tagged_doc('That transaction was fradulant.', nlp))

print(cosine_sim(a, b)) # 0.07102317

Just to make sure, I checked with exact same vector repeated, and it's proper.

a = model_dbow.infer_vector(create_tagged_doc('That was a fradulent transaction.', nlp))
b = model_dbow.infer_vector(create_tagged_doc('That was a fradulent transaction.', nlp))

print(cosine_sim(a, b)) # 0.9980062

What's going wrong in here?

Upvotes: 1

Views: 497

Answers (2)

gojomo
gojomo

Reputation: 54243

Let's look at the actual tokens you're passing to infer_vector():

In [4]: create_tagged_doc('That was a fradulent transaction.', nlp)                                                           
Out[4]: ['fradulent', 'transaction', '.']

In [5]: create_tagged_doc('That transaction was fradulant.', nlp)                                                             
Out[5]: ['transaction', 'fradulant', '.']

The misspelling 'fraudulant' is probably not in your NYT corpus, and thus may be unknown to the Doc2Vec model, and thus ignored. So you're really calculating doc-vectors for:

['fradulent', 'transaction', '.'] vs ['transaction', '.']

Further, '.' probably isn't very significant - especially if it was present in all training examples. And note that tiny examples (of one to a few words) don't have many subtly-contrasting influences to balance - they're stark utterances, perhaps unlike the bulk of the training data, and inference will be relatively short & with minimal counterbalancing influences (compared to longer texts).

For example, in a Doc2Vec model where words and vectors are co-trained and comparable, like PV-DM models (dm=1), I'm not sure whether for a single-word document like ['transaction'], the more-useful vector would be the inference on that token-list, or just the word-vector for 'transaction'.

And finally, since the range for similarity is -1.0 to 1.0, 0.07 maybe isn't that bad for an effective comparison between ['fradulent', 'transaction', '.'] and ['transaction', '.'].

Upvotes: 1

Bitswazsky
Bitswazsky

Reputation: 4698

Looks like it's an issue with number of epochs. When creating a Doc2Vec instance without specifying number of epochs, e.g. model_dbow = Doc2Vec(dm=0, vector_size=100, negative=5, hs=0, min_count=2, sample = 0, workers=cores), it's set to 5 by default. Apparently that wasn't sufficient for my corpus. I set the epochs to 50, and re-trained the model, and voila! It worked.

Upvotes: 1

Related Questions