Using PyTorch nn.Sequential() to define a network in a flexible way but with results beyond expectation

Question

I tried to define a network in a more flexible way using nn.Sequential so that I can define its number of layers according to layernum:

seed = 0
torch.manual_seed(seed)
# ====== net_a =====
layers = [ nn.Linear(7, 64), nn.Tanh()]
for i in range(layernum-1): # layernum = 3
    layers.append(nn.Linear(64, 64))
    layers.append(nn.Tanh())
layers.append(nn.Linear(64, 8))
net_x = nn.Sequential(*layers)
net_y = nn.Sequential(*layers)
net_z = nn.Sequential(*layers)

# ====== net_b =====
net_x = nn.Sequential(
    nn.Linear(7, 64),
    nn.Tanh(),
    nn.Linear(64, 64),
    nn.Tanh(),
    nn.Linear(64, 64),
    nn.Tanh(),
    nn.Linear(64, 8),
)
net_y = nn.Sequential(
    #... same as net_x
)
net_z = nn.Sequential(
    #... same as net_x
)

# print(net_x)
# print(net_x[0].weight)

I use both of them individually, i.e. they are in the same .py file, but I use either of them and make the other as comments. Both of them consist of 3 networks with respect to 3 dimentions (x,y, and z).

I expected them to be the same network with same training performance.

The structure seems to be the same according to print(net_x):

    # Sequential(
    #   (0): Linear(in_features=7, out_features=64, bias=True)
    #   (1): Tanh()
    #   (2): Linear(in_features=64, out_features=64, bias=True)
    #   (3): Tanh()
    #   (4): Linear(in_features=64, out_features=64, bias=True)
    #   (5): Tanh()
    #   (6): Linear(in_features=64, out_features=8, bias=True)
    # )

But their initial weights are different according to print(net_x[0].weight):

print(net_x[0].weight) # net_a
    # tensor([[-0.0028,  0.2028, -0.3111, -0.2782, -0.1456,  0.1014, -0.0075],
    #         [ 0.2997, -0.0335,  0.1000, -0.1142, -0.0743, -0.3611, -0.2503],
    #         ......

print(net_x[0].weight) # net_b
    # tensor([[ 0.2813,  0.2968,  0.0078,  0.1518,  0.3776, -0.3247,  0.0071],
    #         [ 0.3448, -0.0988, -0.2798,  0.3347,  0.3581,  0.2229,  0.2841],
    #         ......

======ADDED=====

I trained the network like this:

def train_on_batch(x, y, net, stepsize=innerstepsize):
    x = totorch(x)
    y = totorch(y)
    if(use_cuda):
        x,y = x.cuda(),y.cuda()
    net.zero_grad()
    ypred = net(x)
    loss = (ypred - y).pow(2).mean()
    loss.backward()
    for param in net.parameters():
        param.data -= stepsize * param.grad.data

iteration = 100
for iter in range(iteration):

    # TRAIN
    PrepareSample() # get in_support
    for i in range(tnum_support):
        out_x = trajectory_support_x[i,1:9]
        out_y = trajectory_support_y[i,1:9]
        out_z = trajectory_support_z[i,1:9]
        # Do SGD on this task
        for _ in range(innerepochs): # SGD 1 times
            train_on_batch(in_support[i], out_x, net_x)
            train_on_batch(in_support[i], out_y, net_y)
            train_on_batch(in_support[i], out_z, net_z)

    # TEST
    if iter==0 or (iter+1) % 10 == 0:
        ind = [0,1,2,3,4,5,6,7,8,9]
        loss = [0,0,0,0,0,0]
        for i in range(tnum_test):
            inputs = in_test[i]
            outputs_x = trajectory_test_x[i].tolist()
            x_test = trajectory_test_x[i,[0,9]]
            y_test = trajectory_test_x[i,1:9]
            pred_x = np.hstack((x_test[0],predict(inputs, net_x),x_test[1]))
            loss[i] = np.square(predict(inputs, net_x) - y_test).mean() # mse

            inputs = in_test[i]
            outputs_y = trajectory_test_y[i].tolist()
            x_test = trajectory_test_y[i,[0,9]]
            y_test = trajectory_test_y[i,1:9]
            pred_y = np.hstack((x_test[0],predict(inputs, net_y),x_test[1]))
            loss[i+2] = np.square(predict(inputs, net_y) - y_test).mean() # mse

            inputs = in_test[i]
            outputs_z = trajectory_test_z[i].tolist()
            x_test = trajectory_test_z[i,[0,9]]
            y_test = trajectory_test_z[i,1:9]
            pred_z = np.hstack((x_test[0],predict(inputs, net_z),x_test[1]))
            loss[i+4] = np.square(predict(inputs, net_z) - y_test).mean() # mse

        iterNum.append(iter+1)
        avgloss.append(np.mean(loss))

both of them are trained with exactly the same data (they are in the same .py file and of course use the same data).

=====This is avgloss of net_a:

=====This is avgloss of net_a with torch.manual_seed(seed) before every network definition:

=====This is avgloss of net_b:

=====This is avgloss of net_b with torch.manual_seed(seed) before every network definition: The training of net_a is weird, the MSE is high at initial time, and didn't reduce. On the contrary, the training of net_b seems common, the MSE is relatively low at first, and reduce to a smaller value after 100 iterations.

Is anyone know how to fix this? I would like to go through different layer number, layer size, activation functions of the network. I don't want to write every network of specified hyper-parameters.

hkchengrex · Accepted Answer

The random state is different after torch initialized the weights in the first network. You need to reset the random state to keep the same initialization by calling torch.manual_seed(seed) after the definition of the first network and before the second one.
The problem lies in net_x/y/z -- it will be perfectly fine if it were just net_x. When you use nn.Sequential, it does not create new modules but instead stores a reference to the given module. So, in your first definition, you only have one copy of the layers, meaning that all net_x/y/z are shared-weight. They have independent weights in your second definition, which is naturally what we are after.

You might definite it like this instead:

def get_net():
    layers = [ nn.Linear(7, 64), nn.Tanh()]
    for i in range(layernum-1): # layernum = 3
        layers.append(nn.Linear(64, 64))
        layers.append(nn.Tanh())
    layers.append(nn.Linear(64, 8))
    return layers

net_x = nn.Sequential(*get_net())
net_y = nn.Sequential(*get_net())
net_z = nn.Sequential(*get_net())

Each time get_net is called, it creates a new copy.

Using PyTorch nn.Sequential() to define a network in a flexible way but with results beyond expectation

Answers (1)

Related Questions