Reputation: 11
I'm trying to port a trained Multi-cell GRU model in TensorFlow 1.x to PyTorch because I want to combine the encoder with some other, more advanced PyTorch modules. I managed to extract the weights to a separate NumPy array per layer. However, when I recreate the model in PyTorch and manually set the weights, my output for some test examples is not identical to the original model. The dimensions of all tensors are correct, but the numerical values are vastly different.
This is the original tf model (activation in the final dense layer is tanh):
encoder_cell = [tf.nn.rnn_cell.GRUCell(size) for size in self.cell_size]
encoder_cell = tf.contrib.rnn.MultiRNNCell(encoder_cell)
encoder_outputs, encoder_state = tf.nn.dynamic_rnn(encoder_cell,
encoder_emb_inp,
sequence_length=self.input_len,
dtype=tf.float32,
time_major=False)
emb = tf.layers.dense(tf.concat(encoder_state, axis=1),
self.embedding_size
)
emb = self.emb_activation(emb)
return emb
This is my recreated PyTorch model:
class GRU_Encoder(nn.Module):
def __init__(self, input_size=40, embedding_size=32, hidden_size=512):
super(GRU_Encoder, self).__init__()
self.weights_dir = '/media/drives/drive1/robin/cddd/default_model/weights'
self.embedding_sizes = [32, 512, 1024, 2048]
# Initial embedding layer
char_weights = np.load(f'{self.weights_dir}/char_embedding_0.npy')
self.char_projection = nn.Linear(input_size, embedding_size, bias=False)
self.char_projection.weight.data = torch.FloatTensor(char_weights.T)
# Create 3 GRU layers
self.gru_layers = nn.ModuleList([
nn.GRU(
input_size=self.embedding_sizes[i],
hidden_size=self.embedding_sizes[i+1],
batch_first=True
) for i in range(3)
])
# Final dense layer
dense_kernel = np.load(f'{self.weights_dir}/Encoder_dense_kernel_0.npy')
self.dense = nn.Linear(sum(self.embedding_sizes[1:]), dense_kernel.shape[1])
self.tanh = nn.Tanh()
self._load_gru_weights()
self._load_dense_weights()
def _load_gru_weights(self):
for i in range(3):
# Load weights
gates_kernel = np.load(f'{self.weights_dir}/Encoder_rnn_multi_rnn_cell_cell_{i}_gru_cell_gates_kernel_0.npy')
gates_bias = np.load(f'{self.weights_dir}/Encoder_rnn_multi_rnn_cell_cell_{i}_gru_cell_gates_bias_0.npy')
candidate_kernel = np.load(f'{self.weights_dir}/Encoder_rnn_multi_rnn_cell_cell_{i}_gru_cell_candidate_kernel_0.npy')
candidate_bias = np.load(f'{self.weights_dir}/Encoder_rnn_multi_rnn_cell_cell_{i}_gru_cell_candidate_bias_0.npy')
input_size = self.embedding_sizes[i]
hidden_size = self.embedding_sizes[i+1]
# Properly reshape weights for PyTorch GRU format
# PyTorch expects (3 * hidden_size, input_size) for ih weights
# and (3 * hidden_size, hidden_size) for hh weights
# Split input and hidden weights
gates_kernel_i = gates_kernel[:input_size, :] # Input weights
print(gates_kernel_i.shape)
gates_kernel_h = gates_kernel[input_size:, :] # Hidden weights
print(gates_kernel_h.shape)
candidate_kernel_i = candidate_kernel[:input_size, :]
candidate_kernel_h = candidate_kernel[input_size:, :]
# Combine weights in PyTorch's expected format
w_ih = np.concatenate([
gates_kernel_i[:, :hidden_size], # reset gate
gates_kernel_i[:, hidden_size:], # update gate
candidate_kernel_i
], axis=1)
w_hh = np.concatenate([
gates_kernel_h[:, :hidden_size], # reset gate
gates_kernel_h[:, hidden_size:], # update gate
candidate_kernel_h
], axis=1)
# Combine biases
b_ih = np.concatenate([
gates_bias[:hidden_size], # reset gate
gates_bias[hidden_size:], # update gate
candidate_bias
])
print('You are here')
print(self.gru_layers[i].bias_ih_l0.shape)
print(self.gru_layers[i].bias_hh_l0.shape)
# Set weights and biases
self.gru_layers[i].weight_ih_l0.data = torch.FloatTensor(w_ih.T)
self.gru_layers[i].weight_hh_l0.data = torch.FloatTensor(w_hh.T)
self.gru_layers[i].bias_ih_l0.data = torch.FloatTensor(b_ih)
self.gru_layers[i].bias_hh_l0.data = torch.zeros_like(self.gru_layers[i].bias_hh_l0)
def _load_dense_weights(self):
dense_kernel = np.load(f'{self.weights_dir}/Encoder_dense_kernel_0.npy')
dense_bias = np.load(f'{self.weights_dir}/Encoder_dense_bias_0.npy')
self.dense.weight.data = torch.FloatTensor(dense_kernel.T)
self.dense.bias.data = torch.FloatTensor(dense_bias)
def forward(self, x):
# Initial projection
x = self.char_projection(x)
# Process through GRU layers
hidden_states = []
current_input = x
for gru in self.gru_layers:
output, hidden = gru(current_input)
print(f' Output: {output.shape}')
print(f' Hidden: {hidden.shape}')
output_array = np.array(output)
print(np.array(hidden[-1]))
hidden_states.append(hidden[-1])
current_input = output
# Combine hidden states and apply final transformation
combined_hidden = torch.cat(hidden_states, dim=1)
x = self.dense(combined_hidden)
x = self.tanh(x)
return x
After self.char_projection, x is identical to encoder_emb_inp from the original tf model, so up to that point everything works. I'm really losing my mind over why these two models are not identical
EDIT
So I did a small test by creating a single-layer GRU in TensorFlow with random weights, and then extracting those weights to use them in an equivalent PyTorch GRU layer. Then I ran the same random input sequence through the models.
This is the output from tf:
[[ 0.05467869 0.33968428 0.2105151 -0.12761241 0.21320477 0.10551707 -0.50173706 -0.6246785 -0.15300453 -0.41065386 0.68777466 -0.09512346 -0.19455332 0.16397609 -0.39829862 -0.08076846 -0.5778191 0.6005554 -0.21811792 -0.40445453 0.14882612 -0.15725932 0.78335345 -0.01807535 -0.11371396 0.36048588 0.05701765 0.07090355 -0.00806656 0.23904625 0.15240604 0.16763008 -0.10185873 -0.05101678 0.2543993 0.15713784 -0.3960471 0.19644792 0.41446018 0.06870119 0.6141467 0.04652876 -0.18108694 0.0198893 -0.06038892 0.08714387 -0.61984295 -0.53614116 0.3603348 -0.5454426 0.05955757 0.19048552 0.35636324 0.41100442 0.02487492 -0.094344 0.09468287 -0.3476482 -0.25992087 -0.3641351 -0.39407793 0.07722727 -0.18467858 -0.23657729]]
And this is the output from PyTorch:
tensor([[ 0.0468, 0.3595, 0.2212, -0.1979, 0.1873, 0.0779, -0.4928, -0.6284, -0.2067, -0.4314, 0.6834, -0.0977, -0.2465, 0.1701, -0.4250, -0.0407, -0.5658, 0.5969, -0.2398, -0.3759, 0.1890, -0.1583, 0.7863, -0.0432, -0.1087, 0.3510, 0.0330, 0.1209, -0.0116, 0.2452, 0.1448, 0.1509, -0.1180, -0.0370, 0.2429, 0.1955, -0.4588, 0.2224, 0.4183, 0.0875, 0.6251, 0.0623, -0.1790, 0.0036, 0.0028, 0.1018, -0.6116, -0.5645, 0.3532, -0.5335, 0.0118, 0.1614, 0.3730, 0.4103, -0.0311, -0.1103, 0.0696, -0.3656, -0.2722, -0.3591, -0.4045, 0.1104, -0.1659, -0.2565]])
Both vectors are clearly similar, but there is quite a substantial numerical difference between the two methods.
Does anyone know the origin behind this difference, and is there any solution to make PyTorch give the same output as TensorFlow?
EDIT V2
The result seems to be entirely due to the order of matrix multiplication in the step where h^ is calculated: https://pytorch.org/docs/stable/generated/torch.nn.GRU.html
TensorFlow 1.x uses the original implementation, but PyTorch does not allow for switching to this. I will try to make a custom GRU PyTorch implementation to mimic this behavior, since as far as I'm aware this does not exist yet.
Upvotes: 0
Views: 38
Reputation: 11
After a few days of work, I was able to fully mimic the behavior of TF 1.x in my PyTorch model. I have created a new CustomGRUCell where the order of the Hadamard product for the calculation of the new candidate tensor is changed. See the note in the PyTorch docs for clarification: https://pytorch.org/docs/stable/generated/torch.nn.GRU.html
This CustomGRUCell has been implemented in a multi-layer GRU (which allows for multiple layers with different hidden sizes, another feature PyTorch does not have), and this was then used with the weights copied from the original TensorFlow model.
For anyone interested in the solution, see the full code on my GitHub: https://github.com/robin-poelmans/CDDD_torch/tree/main
Upvotes: 1