Yehoshaphat Schellekens
Yehoshaphat Schellekens

Reputation: 2385

Extract embedded vecor per word from h2o.word2vec object

I'm trying to create a pre-trained embedding layer, using h2o.word2vec, i'm looking to extract each word in the model and its equivalent embedded vector.

Code:

library(data.table)
library(h2o)
h2o.init(nthreads = -1)

comment <- data.table(comments='ExplanationWhy the edits made under my username Hardcore Metallica 
                      Fan were reverted They werent vandalisms just closure on some GAs after I voted 
                      at New York Dolls FAC And please dont remove the template from the talk page since Im retired now')

comments.hex <- as.h2o(comment, destination_frame = "comments.hex", col.types=c("String"))

words <- h2o.tokenize(comments.hex$comments, "\\\\W+")

vectors <- 3 # Only 10 vectors to save time & memory
w2v.model <- h2o.word2vec(words
                          , model_id = "w2v_model"
                          , vec_size = vectors
                          , min_word_freq = 1
                          , window_size = 2
                          , init_learning_rate = 0.025
                          , sent_sample_rate = 0
                          , epochs = 1) # only a one epoch to save time
print(h2o.findSynonyms(w2v.model, "the",2))

The h2o API enables me to get the cosine of two word, but i'm just looking to get the vector of each work in my vocabulary, how can i get it? couldn't find any simple method in the API that gives it

Thanks in advance

Upvotes: 1

Views: 328

Answers (1)

Lauren
Lauren

Reputation: 5778

you can use the method w2v_model.transform(words=words)

(complete options are: w2v_model.transform(words =, aggregate_method =)

where words is an H2O Frame made of a single column containing source words (Note that you can specify to include a subset of this frame) and aggregate_method specifies how to aggregate sequences of words.

if you don't specify an aggregation method, then no aggregation is performed, and each input word is mapped to a single word-vector. If the method is AVERAGE, then the input is treated as sequences of words delimited by NA.

For example:

av_vecs = w2v_model.transform(words, aggregate_method = "AVERAGE")

Upvotes: 2

Related Questions