Rafael Higa
Rafael Higa

Reputation: 725

nodevectors not returning all nodes

I'm trying to use nodevector's Node2Vec class to get an embedding for my graph. I can't show the entire code, but basically this is what I'm doing:

import networkx as nx
import pandas as pd
import nodevectors

n2v = nodevectors.Node2Vec(n_components=128,
                           walklen=80,
                           epochs=3,
                           return_weight=1,
                           neighbor_weight=1,
                           threads=4)
G = nx.from_pandas_edgelist(df, 'customer', 'item', edge_attr='weight', create_using=nx.Graph)
n2v.fit(G)
model = n2v.model
shape = model.ww.vectors.shape

I know G has all the nodes from my scope. Then, I fit the model, but model.ww.vectors has a number of rows smaller than my number of nodes.

I'm not successfully finding why do the number of nodes represented in my embedding by model.ww.vectors is lower than my actual number of nodes in G.

Does anyone know why it happens?

Upvotes: 0

Views: 389

Answers (1)

gojomo
gojomo

Reputation: 54233

TL;DR: Your non-default epochs=3 can result in some nodes appearing only 3 times – but the inner Word2Vec model by default ignores tokens appearing fewer than 5 times. Upping to epochs=5 may be a quick fix - but read on for the reasons & tradeoffs with various defaults.

--

If you're using the nodevectors package described here, it seems to be built on Gensim's Word2Vec – which uses a default min_count=5.

That means any tokens – in this case, nodes – which appear fewer than 5 times are ignored. Especially in the natural-language contexts where Word2Vec was pioneered, discarding such rare words entirely usually has multiple benefits:

  • from only a few idiosyncratic examples, such rare words themselves get peculiar vectors less-likely to generalize to downstream uses (other texts)
  • compared to other frequent words, each gets very little training effort overall, & thus provides only a little pushback on shared model weights (based on their peculiar examples) - so the vectors are weaker & retain more arbitrary influence from random-initialization & relative positioning in the corpus. (More-frequent words provide more varied, numerous examples to extract their unique meaning.)
  • because of the Zipfian distribution of word-frequencies in natural language, there are a lot of such low-frequency words – often even typos – and altogether they take up a lot of the model's memory & training-time. But they don't individually get very good vectors, or have generalizable beneficial influences on the shared model. So they wind up serving a lot like noise that weakens other vectors for more-frequent words, as well.

So typically in Word2Vec, discarding rare words only gives up low-value vectors while simultaneously speeding training, shrinking memory requirements, & improving the quality of the remaining vectors: a big win.

Although the distribution of node-names in graph random-walks may be very different from natural-language word-frequencies, some of the same concerns still apply for nodes that appear rarely. On the other hand, if a node truly only appears at the end of a long chain of nodes, every walk to or from it will include the exact same neighbors - and maybe extra appearances in more walks would add no new variety-of-information (at least within the inner Word2Vec window of analysis).

You may be able to confirm if the default min_count is your issue by using the Node2Vec keep_walks parameter to store the generated walks, then checking: are exactly the nodes that are 'missing' appearing fewer than min_count times in the walks?

If so, a few options may be:

  • override min_count using the Node2Vec w2vparams option to something like min_count=1. As noted above, this is always a bad idea in traditional natural-language Word2Vec - but maybe it's not so bad in a graph application, where for rare/outer-edge nodes one walk is enough, and then at least you have whatever strange/noisy vector results from that minimal training.
  • try to influence the walks to ensure all nodes appear enough times. I suppose some values of the Node2Vec walklen, return_weight, & neighbor_weight could improve coverage - but I don't think they could guarantee all nodes appear in at least N (say, 5, to match the default min_count) different walks. But it looks like the Node2Vec epochs parameter controls how many time every node is used as a starting point – so epochs=5 would guarantee every node appears at least 5 times, as the start of 5 separate walks. (Notably: the Node2Vec default is epochs=20 - which would never trigger a bad interaction with the internal Word2Vec min_count=5. But setting your non-default epochs=3 risks leaving some nodes with only 3 appearances.)

Upvotes: 1

Related Questions