Rama
Rama

Reputation: 1069

Using word2vec on phrases

I have a text file with phrases on each line. If I run the word2vec on this file it gives me a numerical vector by tokenizing the file into words. Like this,

the -0.464252 0.177642 -1.212928 0.737752 0.990782 1.530809 1.053639 
0.182065 0.753926 0.082467  
of -0.281145 0.060403 -0.877230 0.566957 0.748220 1.108621 0.711598 
0.135636 0.489113 0.059783  
to -0.352605 0.101068 -0.995506 0.600547 0.809564 1.360837 0.905638 
0.114751 0.596093 0.067007 

Instead, I want it to assume each line as a word and output a single vector for each line. Something like this,

Suspension of sitting -0.244289 0.111375 -0.722939 0.366711 0.590016 0.904601 0.622145 0.098230 0.431038 0.008134

This is the package I'm using. 'https://github.com/danielfrg/word2vec'

How do I accomplish this?

Upvotes: 2

Views: 3290

Answers (2)

ivanicki.ilia
ivanicki.ilia

Reputation: 55

Rama!

You can use not word2vec, but doc2vec

Or you can receive summary statistic of all word vectors in phrase: e.g. mean of each component of vectors, median of each component of vectors, min, max and so on

It's on of the papers with description of using this technique https://arxiv.org/abs/1607.01759

Upvotes: 1

fnl
fnl

Reputation: 5301

Replace the spaces in your lines with underscores: cat corpus.txt | tr " " "_" > corpus_underscored.txt

Now, the embeddings will be created for the whole phrases, like in: Suspension_of_sitting SOMENUM SOMENUM SOMENUM ...

Note that I am not sure what your embedding is supposed to be, though. word2vec will simply embed each phrase within the window of phrases coming before and after each phrase now (just like before with words). So if the phrases before and after your target phrase are not meaningful with respect to that target phrase, your numbers will neither be meaningful.

Upvotes: 1

Related Questions