Reputation: 143
I tried to define a network in a more flexible way using nn.Sequential
so that I can define its number of layers according to layernum
:
seed = 0
torch.manual_seed(seed)
# ====== net_a =====
layers = [ nn.Linear(7, 64), nn.Tanh()]
for i in range(layernum-1): # layernum = 3
layers.append(nn.Linear(64, 64))
layers.append(nn.Tanh())
layers.append(nn.Linear(64, 8))
net_x = nn.Sequential(*layers)
net_y = nn.Sequential(*layers)
net_z = nn.Sequential(*layers)
# ====== net_b =====
net_x = nn.Sequential(
nn.Linear(7, 64),
nn.Tanh(),
nn.Linear(64, 64),
nn.Tanh(),
nn.Linear(64, 64),
nn.Tanh(),
nn.Linear(64, 8),
)
net_y = nn.Sequential(
#... same as net_x
)
net_z = nn.Sequential(
#... same as net_x
)
# print(net_x)
# print(net_x[0].weight)
I use both of them individually, i.e. they are in the same .py file, but I use either of them and make the other as comments. Both of them consist of 3 networks with respect to 3 dimentions (x,y, and z).
I expected them to be the same network with same training performance.
The structure seems to be the same according to print(net_x)
:
# Sequential(
# (0): Linear(in_features=7, out_features=64, bias=True)
# (1): Tanh()
# (2): Linear(in_features=64, out_features=64, bias=True)
# (3): Tanh()
# (4): Linear(in_features=64, out_features=64, bias=True)
# (5): Tanh()
# (6): Linear(in_features=64, out_features=8, bias=True)
# )
But their initial weights are different according to print(net_x[0].weight)
:
print(net_x[0].weight) # net_a
# tensor([[-0.0028, 0.2028, -0.3111, -0.2782, -0.1456, 0.1014, -0.0075],
# [ 0.2997, -0.0335, 0.1000, -0.1142, -0.0743, -0.3611, -0.2503],
# ......
print(net_x[0].weight) # net_b
# tensor([[ 0.2813, 0.2968, 0.0078, 0.1518, 0.3776, -0.3247, 0.0071],
# [ 0.3448, -0.0988, -0.2798, 0.3347, 0.3581, 0.2229, 0.2841],
# ......
======ADDED=====
I trained the network like this:
def train_on_batch(x, y, net, stepsize=innerstepsize):
x = totorch(x)
y = totorch(y)
if(use_cuda):
x,y = x.cuda(),y.cuda()
net.zero_grad()
ypred = net(x)
loss = (ypred - y).pow(2).mean()
loss.backward()
for param in net.parameters():
param.data -= stepsize * param.grad.data
iteration = 100
for iter in range(iteration):
# TRAIN
PrepareSample() # get in_support
for i in range(tnum_support):
out_x = trajectory_support_x[i,1:9]
out_y = trajectory_support_y[i,1:9]
out_z = trajectory_support_z[i,1:9]
# Do SGD on this task
for _ in range(innerepochs): # SGD 1 times
train_on_batch(in_support[i], out_x, net_x)
train_on_batch(in_support[i], out_y, net_y)
train_on_batch(in_support[i], out_z, net_z)
# TEST
if iter==0 or (iter+1) % 10 == 0:
ind = [0,1,2,3,4,5,6,7,8,9]
loss = [0,0,0,0,0,0]
for i in range(tnum_test):
inputs = in_test[i]
outputs_x = trajectory_test_x[i].tolist()
x_test = trajectory_test_x[i,[0,9]]
y_test = trajectory_test_x[i,1:9]
pred_x = np.hstack((x_test[0],predict(inputs, net_x),x_test[1]))
loss[i] = np.square(predict(inputs, net_x) - y_test).mean() # mse
inputs = in_test[i]
outputs_y = trajectory_test_y[i].tolist()
x_test = trajectory_test_y[i,[0,9]]
y_test = trajectory_test_y[i,1:9]
pred_y = np.hstack((x_test[0],predict(inputs, net_y),x_test[1]))
loss[i+2] = np.square(predict(inputs, net_y) - y_test).mean() # mse
inputs = in_test[i]
outputs_z = trajectory_test_z[i].tolist()
x_test = trajectory_test_z[i,[0,9]]
y_test = trajectory_test_z[i,1:9]
pred_z = np.hstack((x_test[0],predict(inputs, net_z),x_test[1]))
loss[i+4] = np.square(predict(inputs, net_z) - y_test).mean() # mse
iterNum.append(iter+1)
avgloss.append(np.mean(loss))
both of them are trained with exactly the same data (they are in the same .py file and of course use the same data).
=====This is avgloss of net_a:
=====This is avgloss of net_a with torch.manual_seed(seed)
before every network definition:
=====This is avgloss of net_b:
=====This is avgloss of net_b with torch.manual_seed(seed)
before every network definition:
The training of
net_a
is weird, the MSE is high at initial time, and didn't reduce. On the contrary, the training of net_b
seems common, the MSE is relatively low at first, and reduce to a smaller value after 100 iterations.
Is anyone know how to fix this? I would like to go through different layer number, layer size, activation functions of the network. I don't want to write every network of specified hyper-parameters.
Upvotes: 2
Views: 7310
Reputation: 4826
The random state is different after torch
initialized the weights in the first network. You need to reset the random state to keep the same initialization by calling torch.manual_seed(seed)
after the definition of the first network and before the second one.
The problem lies in net_x/y/z
-- it will be perfectly fine if it were just net_x
. When you use nn.Sequential
, it does not create new modules but instead stores a reference to the given module. So, in your first definition, you only have one copy of the layers, meaning that all net_x/y/z
are shared-weight. They have independent weights in your second definition, which is naturally what we are after.
You might definite it like this instead:
def get_net():
layers = [ nn.Linear(7, 64), nn.Tanh()]
for i in range(layernum-1): # layernum = 3
layers.append(nn.Linear(64, 64))
layers.append(nn.Tanh())
layers.append(nn.Linear(64, 8))
return layers
net_x = nn.Sequential(*get_net())
net_y = nn.Sequential(*get_net())
net_z = nn.Sequential(*get_net())
Each time get_net
is called, it creates a new copy.
Upvotes: 3