Doug Fir
Doug Fir

Reputation: 21302

is there a way to get just wordvectors for a given word without converting entire bin file to txt first?

(There is no tag for r library rword2vec, so I have just added tags for both r and word2vec. Tag recommendations welcome).

Reading documentation on rword2vec here.

Example: Bin to txt:

To convert binary output file to text format:

###convert .bin to .txt
bin_to_txt("vec.bin","vector.txt")

Use this text file to get word vectors:

data=as.data.frame(read.table("vector.txt",skip=1))
data[1,]

## data.frame': 71291 obs. of  101 variables:
## $ V1  : Factor w/ 71291 levels "a","aa","aaa",..: 55827 63881 45640 2646 45926 31473 1 64596 71091 44557 ...
## $ V2  : num  0.004 1.281 -0.577 -0.352 -0.361 ...
## $ V3  : num  0.00442 0.51466 -0.91757 -0.01408 0.04345 ...
## $ V4  : num  -0.00383 0.36052 0.15737 0.18496 -0.04641 ...
## $ V5  : num  -0.00328 0.0063 1.03664 0.94061 0.95325 ...
## $ V6  : num  0.00137 -0.29928 -0.78016 0.11719 0.46731 ...
## $ V7  : num  0.00302 0.36505 -0.60761 0.13251 1.0106 ...
## $ V8  : num  0.000941 -0.272078 1.016449 0.385708 -0.309844 ...
## $ V9  : num  0.000211 -0.27177 0.371277 -0.084057 -0.759528 ...
## $ V10 : num  -0.0036 -0.8509 -0.5182 0.5113 -0.0053 ...
## $ V11 : num  0.00222 -0.38638 -0.60463 -0.18529 0.23022 ...
## $ V12 : num  -0.00436 -0.13679 0.20418 0.3277 1.7405 ...
## $ V13 : num  0.00125 1.36504 -0.30284 -0.09633 -1.52368 ...
## $ V14 : num  -0.000751 -0.954647 1.317677 0.357123 0.525351 ...

## and so on. 

It takes a very very long time to covert the txt file to a data frame. Given the other functions that read right from the bin file, e.g.

### file_name must be binary
dist=distance(file_name = "vec.bin",search_word = "terrible",num = 10)
dist

##          word              dist
## 1      sorrow 0.629752099514008
## 2    horrible  0.62950724363327
## 3  terrifying 0.627294421195984
## 4       dying 0.626088738441467
## 5       cruel 0.625054001808167
## 6      hunger 0.590250313282013
## 7      doomed 0.577929139137268
## 8    horrific 0.576288521289825
## 9       grief 0.572968125343323
## 10        cry 0.567858517169952

Is there a way to pull the word vectors just for a given input, e.g. "terrible" like example above? The example shows the distance between terrible and these close words. Instead I'm seeking the vectors on their own just for the word terrible.

Upvotes: 0

Views: 214

Answers (1)

Bilash Amantay
Bilash Amantay

Reputation: 101

How about if you load external pre-trained word2vec? For example, in Python you can:

google_word2vec = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz',binary=True)
google_word2vec.most_similar('good')

[('horrible', 0.92439204454422),
 ('horrendous', 0.8467271327972412),
 ('dreadful', 0.802276611328125),
 ('awful', 0.7478912472724915),
 ('horrid', 0.7179027795791626),
 ('atrocious', 0.6891814470291138),
 ('horrific', 0.6830835342407227),
 ('bad', 0.6828612089157104),
 ('appalling', 0.6752808690071106),
 ('horrible_horrible', 0.6672273874282837)]

Upvotes: 1

Related Questions