Ryszard Tuora
Ryszard Tuora

Reputation: 1

How can I recover the likelihood of a certain word appearing in a given context from word embeddings?

I know that some methods of generating word embeddings (e.g. CBOW) are based on predicting the likelihood of a given word appearing in a given context. I'm working with polish language, which is sometimes ambiguous with respect to segmentation, e.g. 'Coś' can be either treated as one word, or two words which have been conjoined ('Co' + '-ś') depending on the context. What I want to do, is create a tokenizer which is context sensitive. Assuming that I have the vector representation of the preceding context, and all possible segmentations, could I somehow calculate, or approximate the likelihood of particular words appearing in this context?

Upvotes: 0

Views: 185

Answers (1)

Jindřich
Jindřich

Reputation: 11240

This very much depends on the way how you got your embeddings. The CBOW model has two parameters the embedding matrix that is denoted v and the output projection matrix v'. If you want to recover the probabilities that are used in the CBOW model at training time, you need to get v' as well. See equation (2) in the word2vec paper. Tools for pre-computing word embeddings usually don't do that, so you would need to modify them yourself.

Anyway, if you want to compute a probability of a word, given a context, you should rather think about using a (neural) language model than a table of word embeddings. If you search the Internet, I am sure you will find something that suits your needs.

Upvotes: 0

Related Questions