Eric Broda
Eric Broda

Reputation: 7241

Creating embeddings using StellarGraph are not reproducible

I am using StellarGraph (a fantastic graph neural network package) and am trying to create embeddings for a particular graph/feature set. Unfortunately, the embeddings are different each time I create/train the graph despite providing identical information each time.

Is this bug, or am I using StellarGraph incorrectly?

Below is the code that demonstrates the issue:

import networkx as nx
import random
import numpy as np
import pandas as pd
import keras
import stellargraph as sg
from stellargraph.mapper import GraphSAGELinkGenerator, GraphSAGENodeGenerator
from stellargraph.layer import GraphSAGE, link_classification
from stellargraph.data import UnsupervisedSampler

# Establish random seed
RANDOM_SEED = 42
random.seed(RANDOM_SEED)

# Create a graph from well-known karate club data
print(f"Creating graph")
graph = nx.karate_club_graph()

# Create features for each node
print(f"Creating features")
features = []
nodes = list(graph.nodes)
columns = ["c-" + str(x) for x in range(10)]
nodes.sort()
for node in nodes:
    f = {c: random.random() for c in columns}
    features.append(f)

features_df = pd.DataFrame(features)
print(f"features_df: \n{features_df}")

for i in range(2):
    print(f"----- Iteration: {i} -----")

    # Create the model and generators
    print(f"Creating the model and generators")
    Gs = sg.StellarGraph(graph, node_features=features_df)
    unsupervisedSamples = UnsupervisedSampler(Gs, nodes=graph.nodes(), length=5, number_of_walks=3, seed=RANDOM_SEED)
    train_gen = GraphSAGELinkGenerator(Gs, 50, [5, 5], seed=RANDOM_SEED).flow(unsupervisedSamples)
    graphsage = GraphSAGE(layer_sizes=[100, 100], generator=train_gen, bias=True, dropout=0.0, normalize="l2")
    x_inp_src, x_out_src = graphsage.node_model(flatten_output=False)
    x_inp_dst, x_out_dst = graphsage.node_model(flatten_output=False)

    x_inp = [x for ab in zip(x_inp_src, x_inp_dst) for x in ab]
    x_out = [x_out_src, x_out_dst]
    edge_embedding_method = "l2"
    prediction = link_classification(output_dim=1, output_act="sigmoid", edge_embedding_method=edge_embedding_method)(x_out)

    # Create and train the Keras model
    model = keras.Model(inputs=x_inp, outputs=prediction)
    learning_rate = 1e-2
    model.compile(
        optimizer=keras.optimizers.Adam(lr=learning_rate),
        loss=keras.losses.binary_crossentropy,
        metrics=[keras.metrics.binary_accuracy])

    _ = model.fit_generator(train_gen, epochs=5, verbose=2, use_multiprocessing=False, workers=1, shuffle=False)

    # Create the embeddings
    print(f"Creating the embeddings")
    nodes = list(graph.nodes)
    nodes.sort()
    print(f"Nodes: {nodes}")

    # Create a generator that serves up nodes for use in embedding prediction / creation
    node_gen = GraphSAGENodeGenerator(Gs, 50, [5, 5], seed=RANDOM_SEED).flow(nodes)

    embedding_model = keras.Model(inputs=x_inp_src, outputs=x_out_src)
    embeddings = embedding_model.predict_generator(node_gen, workers=4, verbose=1)
    embeddings = embeddings[:, 0, :]

    np.set_printoptions(threshold=10)
    print(f"embeddings: {embeddings.shape} \n{embeddings}")

There are a number of debug (print output) statements when the code is executed. (sample output is shown below). Note that the embeddings are different despite the identical inputs, graph configuration, model configuration, and random see values.

----- Iteration: 0 -----
:
:
1/1 [==============================] - 0s 58ms/step
embeddings: (34, 100) 
[[-0.10566715  0.02253576 -0.18743701 ... -0.1028127   0.03689012
  -0.02482301]
 [-0.03171733  0.01606975 -0.08616363 ... -0.11775644  0.0429472
  -0.02371055]
 [-0.05802531  0.03910012 -0.10229243 ... -0.15050544  0.06637941
  -0.01950052]
 ...
 [ 0.03011296  0.08852117 -0.01836969 ... -0.154132    0.03844732
  -0.08643046]
 [ 0.01052345 -0.0123206   0.08913474 ... -0.11741614  0.03202919
  -0.04432516]
 [ 0.01951274  0.06263477  0.07959272 ... -0.10350229  0.05735112
  -0.0368157 ]]
:
:
----- Iteration: 1 -----
embeddings: (34, 100) 
[[ 0.11182436 -0.02642134  0.01168384 ...  0.10322241 -0.01680471
  -0.03918815]
 [ 0.02391489  0.02674667 -0.00091334 ...  0.12946768 -0.02389602
  -0.01414653]
 [ 0.08718258 -0.01711811 -0.05704292 ...  0.13477756 -0.00658288
  -0.05889895]
 ...
 [ 0.06843725 -0.13134597 -0.10870655 ...  0.11091235 -0.05146989
  -0.06138216]
 [-0.00593233 -0.05901312 -0.02113489 ... -0.01590953 -0.02516254
  -0.02280537]
 [ 0.00871993 -0.04059998 -0.07237951 ... -0.01590569 -0.00954109
  -0.01116194]]

Upvotes: 2

Views: 1099

Answers (1)

kevin
kevin

Reputation: 71

This was previously a bug in stellargraph, which has now been resolved in v0.9.0 https://github.com/stellargraph/stellargraph/releases/tag/v0.9.0

Unsupervised GraphSAGE has now been updated and tested for reproducibility. Ensuring all seeds are set, running the same pipeline should give reproducible embeddings.

Currently "ensuring all seeds are set" for unsupervised GraphSAGE means:

  • fixing the seed for these external packages: numpy, tensorflow, and random
  • providing a seed when constructing UnsupervisedSampler and GraphSAGELinkGenerator objects. These classes are used to perform random walks and neighbourhood sampling.

Upvotes: 3

Related Questions