Reputation: 725
I'm trying to use nodevector
's Node2Vec
class to get an embedding for my graph. I can't show the entire code, but basically this is what I'm doing:
import networkx as nx
import pandas as pd
import nodevectors
n2v = nodevectors.Node2Vec(n_components=128,
walklen=80,
epochs=3,
return_weight=1,
neighbor_weight=1,
threads=4)
G = nx.from_pandas_edgelist(df, 'customer', 'item', edge_attr='weight', create_using=nx.Graph)
n2v.fit(G)
model = n2v.model
shape = model.ww.vectors.shape
I know G
has all the nodes from my scope. Then, I fit the model, but model.ww.vectors
has a number of rows smaller than my number of nodes.
I'm not successfully finding why do the number of nodes represented in my embedding by model.ww.vectors
is lower than my actual number of nodes in G
.
Does anyone know why it happens?
Upvotes: 0
Views: 389
Reputation: 54233
TL;DR: Your non-default epochs=3
can result in some nodes appearing only 3 times – but the inner Word2Vec
model by default ignores tokens appearing fewer than 5 times. Upping to epochs=5
may be a quick fix - but read on for the reasons & tradeoffs with various defaults.
--
If you're using the nodevectors
package described here, it seems to be built on Gensim's Word2Vec
– which uses a default min_count=5
.
That means any tokens – in this case, nodes – which appear fewer than 5 times are ignored. Especially in the natural-language contexts where Word2Vec
was pioneered, discarding such rare words entirely usually has multiple benefits:
So typically in Word2Vec
, discarding rare words only gives up low-value vectors while simultaneously speeding training, shrinking memory requirements, & improving the quality of the remaining vectors: a big win.
Although the distribution of node-names in graph random-walks may be very different from natural-language word-frequencies, some of the same concerns still apply for nodes that appear rarely. On the other hand, if a node truly only appears at the end of a long chain of nodes, every walk to or from it will include the exact same neighbors - and maybe extra appearances in more walks would add no new variety-of-information (at least within the inner Word2Vec
window
of analysis).
You may be able to confirm if the default min_count
is your issue by using the Node2Vec
keep_walks
parameter to store the generated walks, then checking: are exactly the nodes that are 'missing' appearing fewer than min_count
times in the walks?
If so, a few options may be:
min_count
using the Node2Vec
w2vparams
option to something like min_count=1
. As noted above, this is always a bad idea in traditional natural-language Word2Vec
- but maybe it's not so bad in a graph application, where for rare/outer-edge nodes one walk is enough, and then at least you have whatever strange/noisy vector results from that minimal training.Node2Vec
walklen
, return_weight
, & neighbor_weight
could improve coverage - but I don't think they could guarantee all nodes appear in at least N (say, 5, to match the default min_count
) different walks. But it looks like the Node2Vec
epochs
parameter controls how many time every node is used as a starting point – so epochs=5
would guarantee every node appears at least 5 times, as the start of 5 separate walks. (Notably: the Node2Vec
default is epochs=20
- which would never trigger a bad interaction with the internal Word2Vec
min_count=5
. But setting your non-default epochs=3
risks leaving some nodes with only 3 appearances.)Upvotes: 1