Reputation: 549
I have a Graph Neural Network model I have written using Pytorch. On my CPU I am not getting fantastic performance, so I tried to port it over to a V100 GPU I have access to. In this process, I have received a huge performance decrease (around 10times slower).
I have two ideas of where might be the issue, but I would like some input to try get the optimum performance from my model. The first problem might be coming from my custom graph convolutional layer:
class GraphConvLayer(torch.nn.Module):
"""
Based, basically, on https://arxiv.org/abs/1609.02907
Have some modifications:
https://towardsdatascience.com/how-to-do-deep-learning-on-graphs-with-graph-convolutional-networks-7d2250723780
This helped:
https://pytorch.org/docs/master/notes/extending.html
"""
def __init__(self, input_features, output_features, device, bias=True):
super(GraphConvLayer, self).__init__()
self.input_features = input_features
self.output_features = output_features
self.device = device
self.weight = nn.Parameter(torch.FloatTensor(self.input_features, self.output_features))
if bias:
self.bias = nn.Parameter(torch.FloatTensor(self.output_features))
else:
self.register_parameter('bias', None)
# Not a very smart way to initialize weights
self.weight.data.uniform_(-0.1, 0.1)
if bias is not None:
self.bias.data.uniform_(-0.1, 0.1)
def forward(self,input, adj):
# Here, we put in the forward pass:
# Our forward pass needs to be:
# D^-1 * (A + 1) * X * weights
input, adj = input.float(), adj.float()
Identity = torch.eye( len(adj[0]), device = self.device)
A_hat = adj + Identity
D = torch.sum(A_hat, dim=0)
len_D = len(D)
zero = torch.zeros(len_D,len_D, device = self.device)
mask = torch.diag(torch.ones_like(D, device = self.device))
D = mask*torch.diag(D) + (1. - mask)*zero
D_inv = torch.inverse(D)
out = torch.mm(input, self.weight)
out = torch.spmm(A_hat,out)
out = torch.spmm(D_inv, out)
if self.bias is not None:
return out + self.bias
else:
return out
return out
def extra_repr(self):
# (Optional)Set the extra information about this module. You can test
# it by printing an object of this class.
return 'node_features={}, length of weights={}, bias={}'.format(
self.node_features, self.input_features, self.bias is not None
)
Specifically, in the forward pass I am doing a selection of transformations that are described in the towardsdatascience link in the class. Is there something here that is causing this large slow-down? It seems to me that the tensors are all being initialised on the GPU.
Secondly, as all my graphs are different sizes I am being forced into using a batch size of 1. In my training loop I have this:
for batch in tqdm(train_loader):
opt.zero_grad()
adjacency, features, _, nodes = batch
adjacency = adjacency.to(device)
features = features.to(device)
nodes = nodes.to(device)
output = model(features[0], adjacency[0])
loss = F.nll_loss(output, nodes[0])
loss.backward()
opt.step()
This means (as I interpret it) that every single piece of data is being moved to the GPU individually, every loop. This seems like an obvious cause of inefficiency. Is there a way to move all the data into GPU memory at once, outside the training loop, allowing me to remove the adjacency = adjacency.to(device)
lines?
Any help would be really appreciated.
Upvotes: 1
Views: 1475
Reputation: 11420
Your problem is almost guaranteed to be bound by the memory movement to the GPU, especially since you mention your singular batches.
The only ways that may help you speed up the current implementation might be to look into memory maps, which we are not able to see whether or not you are already using them based on the provided code.
Other than that, even with differently sized adjacency matrix, padding might be a valid strategy, if you manage to sort your batches by somewhat equal sizes.
Your forward()
function is also clearly not optimized and might be able to deliver some sort of speedup, but I would expect optimization towards better batching to be of much greater improvement.
Upvotes: 1