Reputation: 1069
I have a text file with phrases on each line. If I run the word2vec on this file it gives me a numerical vector by tokenizing the file into words. Like this,
the -0.464252 0.177642 -1.212928 0.737752 0.990782 1.530809 1.053639
0.182065 0.753926 0.082467
of -0.281145 0.060403 -0.877230 0.566957 0.748220 1.108621 0.711598
0.135636 0.489113 0.059783
to -0.352605 0.101068 -0.995506 0.600547 0.809564 1.360837 0.905638
0.114751 0.596093 0.067007
Instead, I want it to assume each line as a word and output a single vector for each line. Something like this,
Suspension of sitting -0.244289 0.111375 -0.722939 0.366711 0.590016 0.904601 0.622145 0.098230 0.431038 0.008134
This is the package I'm using. 'https://github.com/danielfrg/word2vec'
How do I accomplish this?
Upvotes: 2
Views: 3290
Reputation: 55
Rama!
You can use not word2vec, but doc2vec
Or you can receive summary statistic of all word vectors in phrase: e.g. mean of each component of vectors, median of each component of vectors, min, max and so on
It's on of the papers with description of using this technique https://arxiv.org/abs/1607.01759
Upvotes: 1
Reputation: 5301
Replace the spaces in your lines with underscores:
cat corpus.txt | tr " " "_" > corpus_underscored.txt
Now, the embeddings will be created for the whole phrases, like in:
Suspension_of_sitting SOMENUM SOMENUM SOMENUM ...
Note that I am not sure what your embedding is supposed to be, though. word2vec will simply embed each phrase within the window of phrases coming before and after each phrase now (just like before with words). So if the phrases before and after your target phrase are not meaningful with respect to that target phrase, your numbers will neither be meaningful.
Upvotes: 1