Jiadong Chen
Jiadong Chen

Reputation: 115

Genisim doc2vec: how is short doc processed?

In each tiny step of doc2vec training process, it takes a word and its neighbors within certain length(called window size). The neighbors are summed up, averaged, or concated, and so on and so on.

My question is, what if the window exceed the boundary of a certain doc, like this

Then how are the neighbors summed up, averaged, or concated? Or they are just simply discarded?

I am doing some nlp work and most doc in my dataset are quite short. Appeciate for any idea.

Upvotes: 1

Views: 73

Answers (1)

gojomo
gojomo

Reputation: 54153

The pure PV-DBOW mode (dm=0), which trains quickly and often performs very well (especially on short documents), makes use of no sliding window at all. Each per-document vector is just trained to be good at directly predicting the document's words - neighboring words don't make any difference.

Only when you either switch to PV-DM mode (dm=1), or add interleaved skip-gram word-vector training (dm=0, dbow_words=1) is the window relevant. And then, the window is handled the same as in Word2Vec training: if it would go past either end of the text, it's just truncated to not go over the end, perhaps leaving the effective window lop-sided.

So if you have a text "A B C D E", and a window of 2, when predicting the 1st word 'A', only the 'B' and 'C' to the right contribute (because there are zero words to the left). When predicting the 2nd word 'B', the 'A' to the left and the 'C' and 'D' to the right contribute. And so forth.

An added wrinkle is that to effect a stronger weighting of nearby words in a computationally-efficient manner, the actual window used for any one target prediction is actually of a random size from 1 up to the configured window value. So for window=2, half the time it's really only using a window of 1 on each side, and the other half the time using the full window of 2. (For window=5, it's using an effective value of 1 for 20% of the predictions, 2 for 20% of the predictions, 3 for 20% of the predictions, 4 for 20% of the predictions, and 5 for 20% of the predictions.) This effectively gives nearer words more influence, without the full computational cost of including all full-window words every time or any extra partial-weighting calculations.

Upvotes: 3

Related Questions