Reputation: 21302
(There is no tag for r library rword2vec, so I have just added tags for both r and word2vec. Tag recommendations welcome).
Reading documentation on rword2vec here.
Example: Bin to txt:
To convert binary output file to text format:
###convert .bin to .txt
bin_to_txt("vec.bin","vector.txt")
Use this text file to get word vectors:
data=as.data.frame(read.table("vector.txt",skip=1))
data[1,]
## data.frame': 71291 obs. of 101 variables:
## $ V1 : Factor w/ 71291 levels "a","aa","aaa",..: 55827 63881 45640 2646 45926 31473 1 64596 71091 44557 ...
## $ V2 : num 0.004 1.281 -0.577 -0.352 -0.361 ...
## $ V3 : num 0.00442 0.51466 -0.91757 -0.01408 0.04345 ...
## $ V4 : num -0.00383 0.36052 0.15737 0.18496 -0.04641 ...
## $ V5 : num -0.00328 0.0063 1.03664 0.94061 0.95325 ...
## $ V6 : num 0.00137 -0.29928 -0.78016 0.11719 0.46731 ...
## $ V7 : num 0.00302 0.36505 -0.60761 0.13251 1.0106 ...
## $ V8 : num 0.000941 -0.272078 1.016449 0.385708 -0.309844 ...
## $ V9 : num 0.000211 -0.27177 0.371277 -0.084057 -0.759528 ...
## $ V10 : num -0.0036 -0.8509 -0.5182 0.5113 -0.0053 ...
## $ V11 : num 0.00222 -0.38638 -0.60463 -0.18529 0.23022 ...
## $ V12 : num -0.00436 -0.13679 0.20418 0.3277 1.7405 ...
## $ V13 : num 0.00125 1.36504 -0.30284 -0.09633 -1.52368 ...
## $ V14 : num -0.000751 -0.954647 1.317677 0.357123 0.525351 ...
## and so on.
It takes a very very long time to covert the txt file to a data frame. Given the other functions that read right from the bin file, e.g.
### file_name must be binary
dist=distance(file_name = "vec.bin",search_word = "terrible",num = 10)
dist
## word dist
## 1 sorrow 0.629752099514008
## 2 horrible 0.62950724363327
## 3 terrifying 0.627294421195984
## 4 dying 0.626088738441467
## 5 cruel 0.625054001808167
## 6 hunger 0.590250313282013
## 7 doomed 0.577929139137268
## 8 horrific 0.576288521289825
## 9 grief 0.572968125343323
## 10 cry 0.567858517169952
Is there a way to pull the word vectors just for a given input, e.g. "terrible" like example above? The example shows the distance between terrible and these close words. Instead I'm seeking the vectors on their own just for the word terrible.
Upvotes: 0
Views: 214
Reputation: 101
How about if you load external pre-trained word2vec? For example, in Python you can:
google_word2vec = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz',binary=True)
google_word2vec.most_similar('good')
[('horrible', 0.92439204454422),
('horrendous', 0.8467271327972412),
('dreadful', 0.802276611328125),
('awful', 0.7478912472724915),
('horrid', 0.7179027795791626),
('atrocious', 0.6891814470291138),
('horrific', 0.6830835342407227),
('bad', 0.6828612089157104),
('appalling', 0.6752808690071106),
('horrible_horrible', 0.6672273874282837)]
Upvotes: 1