Word2vec on documents each one containing one sentence

Question

I have some unsupervised data (100.000 files) and each file has a paragraph containing one sentence. The preprocessing went wrong and deleted all stop points (.). I used word2vec on a small sample (2000 files) and it treated each document as one sentence. Should I continue the process on all remaining files? Or this would result to a bad model ?

Thank you

gojomo · Accepted Answer

Did you try it, and get bad results?

I'm not sure what you mean by "deleted all stop points". But, Gensim's Word2Vec is oblivious to what your tokens are, and doesn't really have any idea of 'sentences'.

All that matters is the lists-of-tokens you provide. (Sometimes people include puntuation like '.' as tokens, and sometimes it's stripped - and it doesn't make a very big different either way, and to the extent it does, whether it's good or bad may depend on your data & goals.)

Any lists-of-tokens that include neighboring related tokens, for the sort of context-window training that's central to the word2vec algorithm, should work well.

For example, it can't learn anything from one-word texts, where there are no neighboring words. But running togther sentences, paragraphs, and even full documents into long texts works fine.

Even concatenating wholly-unrelated texts doesn't hurt much: the bit of random noise from unrelated words now in-each-others' windows is outweighed, with enough training, by the meaningful relationships in the much-longer runs of truly-related text.

The main limit to consider is that each training text (list of tokens) shouldn't be more than 10,000 tokens long, as internal implementation limits up through Gensim 4.0 mean tokens past the 10,000th position will be ignored. (This limit might eventually be fixed - but until then, just splitting overlong texts into 10,000-token chunks is a fine workaround with negligible effects via the lost contexts at the break points.)

Word2vec on documents each one containing one sentence

Answers (1)

Related Questions