jmuth
jmuth

Reputation: 71

GPU showing no speed up over CPU

I'm training a neural network with 100*100 hidden nodes, four inputs/one output, and batch size of 32, and I am seeing no speed improvement in using the GPU vs. CPU. I only have a limited data set (1067 samples, copied all to the GPU at the beginning), but I would have thought the 33 batches could have run in parallel, more than making up for the time in copying to the GPU. Is my data set too small, or is there potentially some other issue? Here is my code snippet:

def train_for_regression(X, T):
    BATCH_SIZE = 32
    n_epochs = 1000
    learning_rate = 0.01
    device = torch.device("cuda:0")
    Xt = torch.from_numpy(X).float().to(device) #Training inputs are 4 * 1067 samples
    Tt = torch.from_numpy(T).float().to(device) #Training outputs are 1 * 1067 samples
    
    nnet = torch.nn.Sequential(torch.nn.Linear(4, 100), 
                               torch.nn.Tanh(), 
                               torch.nn.Linear(100, 100), 
                               torch.nn.Tanh(),
                               torch.nn.Linear(100, 1))
    nnet.to(device)
    mse_f = torch.nn.MSELoss()
    optimizer = torch.optim.Adam(nnet.parameters(), lr=learning_rate)

    for epoch in range(n_epochs):
        for i in range(0, len(Xt), BATCH_SIZE):
            batch_Xt = Xt[i:i+BATCH_SIZE,:]
            batch_Tt = Tt[i:i+BATCH_SIZE,:]
            optimizer.zero_grad()
            Y = nnet(batch_Xt)
            mse = mse_f(Y, batch_Tt)
            mse.backward()
            optimizer.step()
    return nnet

Upvotes: 0

Views: 977

Answers (1)

stan0
stan0

Reputation: 11807

Chances are the time required for the data to get to the GPU negates the benefit of the GPU. In this case the size of the network seems so small that the CPU should be efficient enough and the speedup from the GPU shouldn't be that big.

Also, GPUs are usually used for matrix computations in parallel, or in this case - a single batch's data multiplied by the weights of the network. So batches shouldn't be processed in parallel unless you take extra steps, like using additional libraries and/or GPUs.

Upvotes: 1

Related Questions