K C
K C

Reputation: 433

How to apply node2vec for building link prediction model

I have been trying to learn python programming and I am trying to implement a project of link prediction.

I have a list containing tuple pairs like :

ten_author_pairs = [('creutzig', 'gao'), ('creutzig', 'linshaw'), ('gao', 'linshaw'), ('jing', 'zhang'), ('jing', 'liu'), ('zhang', 'liu'), ('jing', 'xu'), ('briant', 'einav'), ('chen', 'gao'), ('chen', 'jing'), ('chen', 'tan')]

And I am able to generate unconnected pairs - i.e. the pairs not present in the original list using the following code:

#generating negative examples - 

from itertools import combinations

elements = list(set([e for l in ten_author_pairs for e in l])) # find all unique elements

complete_list = list(combinations(elements, 2)) # generate all possible combinations

#convert to sets to negate the order

set1 = [set(l) for l in ten_author_pairs]
complete_set = [set(l) for l in complete_list]

# find sets in `complete_set` but not in `set1`
ten_unconnnected = [list(l) for l in complete_set if l not in set1]

print(len(ten_author_pairs))
print(len(ten_unconnnected))

This leads me to having a very unbalanced dataset - which might be expected case for a real life dataset.

Next, in order to apply node2vec, first, I convert both these lists into dataframes -

df = pd.DataFrame(ten_author_pairs, columns = ['u1','u2'])
df_negative = pd.DataFrame(ten_unconnected, columns = ['u1','u2'])
df['link'] = 1 #for connected pairs
df_negative['link'] = 0 #for unconnected pairs

df_new = pd.concat([df,df_negative])

Then, I make the graph and apply node2vec as follows:

# build graph
G_data = nx.from_pandas_edgelist(df_new, "u1", "u2", create_using=nx.Graph())

#!pip install node2vec
from node2vec import Node2Vec

# Generate walks
node2vec = Node2Vec(G_data, dimensions=100, walk_length=16, num_walks=50)

# train node2vec model
n2w_model = node2vec.fit(window=7, min_count=1)

Finally I go for link prediction using Logistic regression as follows:

x = [(n2w_model[str(i)]+n2w_model[str(j)]) for i,j in zip(df_new['u1'], df_new['u2'])]

from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(np.array(x), df_new['link'], 
                                            test_size = 0.3, 
                                            random_state = 35)

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(class_weight="balanced")

lr.fit(xtrain, ytrain)
predictions = lr.predict_proba(xtest)
from sklearn.metrics import roc_auc_score
roc_auc_score(ytest, predictions[:,1])

The score I obtain is 0.36 which is very poor.

Can anyone please help me -

  1. in pointing out where am i missing out on concept or code?
  2. please help me in improving my scores.

I really thank you in advance for your help.

Upvotes: 1

Views: 1745

Answers (1)

leo_bouts
leo_bouts

Reputation: 337

Your objective is to predict whether there would be a link between 2 unconnected nodes.

At first you extract the pairs of nodes that don't have a link between them.

The next step is to hide some edges from the given graph. This is needed for preparing a training dataset. As the social network grows new edges are introduced. The machine learning model needs to know the graph evolved.The graph with the hidden edges is the graph G at time t and our current dataset is the graph G at time t+n.

While removing links or edges, you should avoid removing any edge that may produce non connected nodes or networks. The next step is to create features for all the unconnected node pairs including the ones you hid.

The removed edges will be labeled as 1 (positive samples) and the unconnected node pairs as 0 (negative samples).

After the labelling you use the node2vec algorithm to extract node features from the graph. For computing the features of an edge you can add up the features of the nodes of that pair. These features will be trained with a logistic regression model.

You can add more measures as values in your nodes so the model can predict the features according to your needs. For example an adamic/adar index, common neighbours etc.

Because you have splitted your graph in train and test samples you have to find which of the edges have the probabilities the model predicted.

predictions = lr.predict_proba(xtest)

for i in range(len(df)):
    try:
        index_in_x_train = np.where(xtrain == x[i])[0][1]
        predict_proba = lr.predict_proba(xtrain[index_in_x_train].reshape(1, -1))[:, 1]
        print(
            f'Probability of nodes {df.iloc[i, 0]} and {df.iloc[i, 1]} to form a link is : {float(predict_proba) * 100 : .2f}%')
    except:
        continue

The try catch is to ensure that we will not get an out of index error because some arrays will be empty from the xtrain.

The low score may be the result of the way you hide edges. Try tuning the logistic regression with different parameters and test sizes.

Also you will need a bigger dataset so the model can be trained properly.

You could also try different machine learning models such as a random forest classifier or a multilayer perceptron.

Upvotes: 3

Related Questions