PyG graph autoencoder loss is frozen, possible Data object assembly issue

Question

I'm trying to use Graph Autoencoder on a custom PyG Data object, but when I attempt to train it, the loss, AUC and AP do not change. The exact same autoencoder works when using PyTorch Geometric's example data objects, so I think I am making an error somewhere in the process of creating my custom data object. I used data from Geoff Boeing's street network node/edge lists, specifically from Aberdeen in this example. Here's the process of making my data object consisting of node features (xy coords) and edge index (source and destination nodes) from node and edge csv's (already converted to dfs nodes_ab and edges_ab).

# Creating node feature tensors
node_features = nodes_ab[['x', 'y']].values
node_features = torch.tensor(node_features, dtype=torch.float)

# Creating edge index
edge_index = edges_ab[['source', 'dest']].values.T
edge_index = torch.tensor(edge_index, dtype=torch.long)

# Create data object
data = Data(x=node_features, edge_index=edge_index)

# Split data
transform = T.RandomLinkSplit(num_val=0.05,
                              num_test=0.1,
                              is_undirected=True,
                              add_negative_train_samples=True)
train_data, val_data, test_data = transform(data)

# Extract positive and negative edges for train, validation, and test sets
def get_pos_neg_edges(data):
    pos_edge_index = data.edge_label_index[:, data.edge_label == 1]
    neg_edge_index = data.edge_label_index[:, data.edge_label == 0]
    return pos_edge_index, neg_edge_index

train_pos_edge_index, train_neg_edge_index = get_pos_neg_edges(train_data)
val_pos_edge_index, val_neg_edge_index = get_pos_neg_edges(val_data)
test_pos_edge_index, test_neg_edge_index = get_pos_neg_edges(test_data)

# Add these to the data object
data.train_pos_edge_index = train_pos_edge_index
data.train_neg_edge_index = train_neg_edge_index
data.val_pos_edge_index = val_pos_edge_index
data.val_neg_edge_index = val_neg_edge_index
data.test_pos_edge_index = test_pos_edge_index
data.test_neg_edge_index = test_neg_edge_index


# Create encoder and autoencoder
class GCNEncoder(torch.nn.Module):
    def __init__(self, in_channels, out_channels):
        super(GCNEncoder, self).__init__()
        self.conv1 = GCNConv(in_channels, 2 * out_channels, cached=True) # cached only for transductive learning
        self.conv2 = GCNConv(2 * out_channels, out_channels, cached=True) # cached only for transductive learning

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index).relu()
        return self.conv2(x, edge_index)

# parameters
out_channels = 2
num_features = data.num_features
epochs = 100

# model
model = GAE(GCNEncoder(num_features, out_channels))

# move to GPU (if available)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
x = data.x.to(device)
train_pos_edge_index = data.train_pos_edge_index.to(device)

# inizialize the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.03)

def train():
    model.train()
    optimizer.zero_grad()
    z = model.encode(x, train_pos_edge_index)
    loss = model.recon_loss(z, train_pos_edge_index)
    loss.backward()
    optimizer.step()
    print(f"Training loss: {loss.item()}")
    return float(loss)

def test(pos_edge_index, neg_edge_index):
    model.eval()
    with torch.no_grad():
        z = model.encode(x, train_pos_edge_index)
    auc, ap = model.test(z, pos_edge_index, neg_edge_index)
    return auc, ap

# Train the model
for epoch in range(1, epochs + 1):
    loss = train()

    auc, ap = test(data.test_pos_edge_index, data.test_neg_edge_index)
    print('Epoch: {:03d}, AUC: {:.4f}, AP: {:.4f}
 _________________________'.format(epoch, auc, ap))

Sorry for the massive code wall! I don't know where the issue is occurring so wanted to include everything. Here is the output from running this.

Training loss: 34.538780212402344 Epoch: 001, AUC: 0.5000, AP: 0.5000

Training loss: 34.538780212402344 Epoch: 002, AUC: 0.5000, AP: 0.5000

Training loss: 34.538780212402344 Epoch: 003, AUC: 0.5000, AP: 0.5000

Etc etc, for 100 epochs.

I've tried incorporating node labels (y), using dummy variables for the node features, using node index for node labels, and relabeling the index numbers. I am still running into the same issue. I found two other people who experienced this here, but their issues were solved after incorporating (non-index) node features, which hasn't worked for me. Thanks in advance and let me know if I was unclear or more info is needed - this is my first time posting on stack overflow.

EDIT: How I fixed it + preprocessing steps

I forgot to mention some of my data preprocessing steps, which include setting a new index for nodes starting from '0' in order to solve some other issues that were popping up, and transforming the edge dataframe's source and destination nodes to match that, so here's how I did that.

# Creating new index for nodes
nodes_ab['new_index'] = range(len(nodes_ab))
node_index = nodes_ab[['new_index']].values
node_index = torch.tensor(node_index, dtype=torch.float)

# Matching 'u' and 'v' with new node index
edges_ab = edges_ab.merge(nodes_ab[['osmid', 'new_index']], how='left', left_on='u', right_on='osmid')
edges_ab = edges_ab.rename(columns={'new_index': 'new_source'}).drop(columns=['osmid'])
edges_ab = edges_ab.merge(nodes_ab[['osmid', 'new_index']], how='left', left_on='v', right_on='osmid')
edges_ab = edges_ab.rename(columns={'new_index': 'new_dest'}).drop(columns=['osmid'])

Over the process of finding a fix, I changed my transforms to also normalize features and save the data to device, copying this example.

# Create transforms
transform = T.Compose([
    T.NormalizeFeatures(),
    T.ToDevice(device),
    T.RandomLinkSplit(num_val=0.05, num_test=0.1, is_undirected=True,
                      split_labels=True, add_negative_train_samples=True),
])

# Create data object (other features were also added after I got it working)
data = Data(x=node_features, edge_index=edge_index, edge_attr=edge_features, y=node_index)

# Split data
train_data, val_data, test_data = transform(data)

I then changed the model training process a bit so that it would take data from the separate train_data and test_data data objects instead of taking the needed data from those and adding it to the original data objects.

def train(train_data):
    model.train()
    optimizer.zero_grad()
    z = model.encode(train_data.x, train_data.edge_index)
    loss = model.recon_loss(z, train_data.pos_edge_label_index)
    loss.backward()
    optimizer.step()
    print(f"Training loss: {loss.item()}")
    return float(loss)

@torch.no_grad()
def test(data):
    model.eval()
    z = model.encode(data.x, data.edge_index)
    return model.test(z, data.pos_edge_label_index, data.neg_edge_label_index)

for epoch in range(1, epochs + 1):
    loss = train(train_data)

    auc, ap = test(test_data)
    print('Epoch: {:03d}, AUC: {:.4f}, AP: {:.4f}
 _________________________'.format(epoch, auc, ap))

And now the model works! Though I will have to figure out how to get the AUC a little higher, haha.

Training loss: 1.4984779357910156 Epoch: 001, AUC: 0.5098, AP: 0.4645

Training loss: 1.4174818992614746 Epoch: 002, AUC: 0.5104, AP: 0.4647

Training loss: 1.394394874572754 Epoch: 003, AUC: 0.5108, AP: 0.4649

PyG graph autoencoder loss is frozen, possible Data object assembly issue

Answers (1)

Related Questions