Why are the outputs from PyTorch and TensorFlow GRU layers not equivalent?

Question

I'm trying to port a trained Multi-cell GRU model in TensorFlow 1.x to PyTorch because I want to combine the encoder with some other, more advanced PyTorch modules. I managed to extract the weights to a separate NumPy array per layer. However, when I recreate the model in PyTorch and manually set the weights, my output for some test examples is not identical to the original model. The dimensions of all tensors are correct, but the numerical values are vastly different.

This is the original tf model (activation in the final dense layer is tanh):

encoder_cell = [tf.nn.rnn_cell.GRUCell(size) for size in self.cell_size]
        encoder_cell = tf.contrib.rnn.MultiRNNCell(encoder_cell)
        encoder_outputs, encoder_state = tf.nn.dynamic_rnn(encoder_cell,
                                                           encoder_emb_inp,
                                                           sequence_length=self.input_len,
                                                           dtype=tf.float32,
                                                           time_major=False)
emb = tf.layers.dense(tf.concat(encoder_state, axis=1),
                              self.embedding_size
                             )
emb = self.emb_activation(emb)

return emb

This is my recreated PyTorch model:

class GRU_Encoder(nn.Module):
    def __init__(self, input_size=40, embedding_size=32, hidden_size=512):
        super(GRU_Encoder, self).__init__()
        
        self.weights_dir = '/media/drives/drive1/robin/cddd/default_model/weights'
        self.embedding_sizes = [32, 512, 1024, 2048]
        
        # Initial embedding layer
        char_weights = np.load(f'{self.weights_dir}/char_embedding_0.npy')
        self.char_projection = nn.Linear(input_size, embedding_size, bias=False)
        self.char_projection.weight.data = torch.FloatTensor(char_weights.T)
        
        # Create 3 GRU layers
        self.gru_layers = nn.ModuleList([
            nn.GRU(
                input_size=self.embedding_sizes[i],
                hidden_size=self.embedding_sizes[i+1],
                batch_first=True
            ) for i in range(3)
        ])
        
        # Final dense layer
        dense_kernel = np.load(f'{self.weights_dir}/Encoder_dense_kernel_0.npy')
        self.dense = nn.Linear(sum(self.embedding_sizes[1:]), dense_kernel.shape[1])
        self.tanh = nn.Tanh()
        
        self._load_gru_weights()
        self._load_dense_weights()
    
    def _load_gru_weights(self):
        for i in range(3):
            # Load weights
            gates_kernel = np.load(f'{self.weights_dir}/Encoder_rnn_multi_rnn_cell_cell_{i}_gru_cell_gates_kernel_0.npy')
            gates_bias = np.load(f'{self.weights_dir}/Encoder_rnn_multi_rnn_cell_cell_{i}_gru_cell_gates_bias_0.npy')
            candidate_kernel = np.load(f'{self.weights_dir}/Encoder_rnn_multi_rnn_cell_cell_{i}_gru_cell_candidate_kernel_0.npy')
            candidate_bias = np.load(f'{self.weights_dir}/Encoder_rnn_multi_rnn_cell_cell_{i}_gru_cell_candidate_bias_0.npy')
            
            input_size = self.embedding_sizes[i]
            hidden_size = self.embedding_sizes[i+1]
            
            # Properly reshape weights for PyTorch GRU format
            # PyTorch expects (3 * hidden_size, input_size) for ih weights
            # and (3 * hidden_size, hidden_size) for hh weights
            
            # Split input and hidden weights
            gates_kernel_i = gates_kernel[:input_size, :]  # Input weights
            print(gates_kernel_i.shape)
            gates_kernel_h = gates_kernel[input_size:, :]  # Hidden weights
            print(gates_kernel_h.shape)
            candidate_kernel_i = candidate_kernel[:input_size, :]
            candidate_kernel_h = candidate_kernel[input_size:, :]
            
            # Combine weights in PyTorch's expected format
            w_ih = np.concatenate([
                gates_kernel_i[:, :hidden_size],     # reset gate
                gates_kernel_i[:, hidden_size:],     # update gate
                candidate_kernel_i
            ], axis=1)
            
            w_hh = np.concatenate([
                gates_kernel_h[:, :hidden_size],     # reset gate
                gates_kernel_h[:, hidden_size:],     # update gate
                candidate_kernel_h
            ], axis=1)
            
            # Combine biases
            b_ih = np.concatenate([
                gates_bias[:hidden_size],            # reset gate
                gates_bias[hidden_size:],            # update gate
                candidate_bias
            ])
            
            print('You are here')
            print(self.gru_layers[i].bias_ih_l0.shape)
            print(self.gru_layers[i].bias_hh_l0.shape)
            # Set weights and biases
            self.gru_layers[i].weight_ih_l0.data = torch.FloatTensor(w_ih.T)
            self.gru_layers[i].weight_hh_l0.data = torch.FloatTensor(w_hh.T)
            self.gru_layers[i].bias_ih_l0.data = torch.FloatTensor(b_ih)
            self.gru_layers[i].bias_hh_l0.data = torch.zeros_like(self.gru_layers[i].bias_hh_l0)
    
    def _load_dense_weights(self):
        dense_kernel = np.load(f'{self.weights_dir}/Encoder_dense_kernel_0.npy')
        dense_bias = np.load(f'{self.weights_dir}/Encoder_dense_bias_0.npy')
        self.dense.weight.data = torch.FloatTensor(dense_kernel.T)
        self.dense.bias.data = torch.FloatTensor(dense_bias)
    
    def forward(self, x):
        # Initial projection
        x = self.char_projection(x)
        # Process through GRU layers
        hidden_states = []
        current_input = x
        
        for gru in self.gru_layers:
            output, hidden = gru(current_input)
            print(f' Output: {output.shape}')
            print(f' Hidden: {hidden.shape}')
            output_array = np.array(output)
            print(np.array(hidden[-1]))
            hidden_states.append(hidden[-1])
            current_input = output
        
        # Combine hidden states and apply final transformation
        combined_hidden = torch.cat(hidden_states, dim=1)
        x = self.dense(combined_hidden)
        x = self.tanh(x)
        return x

After self.char_projection, x is identical to encoder_emb_inp from the original tf model, so up to that point everything works. I'm really losing my mind over why these two models are not identical

EDIT

So I did a small test by creating a single-layer GRU in TensorFlow with random weights, and then extracting those weights to use them in an equivalent PyTorch GRU layer. Then I ran the same random input sequence through the models.

This is the output from tf:

[[ 0.05467869 0.33968428 0.2105151 -0.12761241 0.21320477 0.10551707 -0.50173706 -0.6246785 -0.15300453 -0.41065386 0.68777466 -0.09512346 -0.19455332 0.16397609 -0.39829862 -0.08076846 -0.5778191 0.6005554 -0.21811792 -0.40445453 0.14882612 -0.15725932 0.78335345 -0.01807535 -0.11371396 0.36048588 0.05701765 0.07090355 -0.00806656 0.23904625 0.15240604 0.16763008 -0.10185873 -0.05101678 0.2543993 0.15713784 -0.3960471 0.19644792 0.41446018 0.06870119 0.6141467 0.04652876 -0.18108694 0.0198893 -0.06038892 0.08714387 -0.61984295 -0.53614116 0.3603348 -0.5454426 0.05955757 0.19048552 0.35636324 0.41100442 0.02487492 -0.094344 0.09468287 -0.3476482 -0.25992087 -0.3641351 -0.39407793 0.07722727 -0.18467858 -0.23657729]]

And this is the output from PyTorch:

tensor([[ 0.0468, 0.3595, 0.2212, -0.1979, 0.1873, 0.0779, -0.4928, -0.6284, -0.2067, -0.4314, 0.6834, -0.0977, -0.2465, 0.1701, -0.4250, -0.0407, -0.5658, 0.5969, -0.2398, -0.3759, 0.1890, -0.1583, 0.7863, -0.0432, -0.1087, 0.3510, 0.0330, 0.1209, -0.0116, 0.2452, 0.1448, 0.1509, -0.1180, -0.0370, 0.2429, 0.1955, -0.4588, 0.2224, 0.4183, 0.0875, 0.6251, 0.0623, -0.1790, 0.0036, 0.0028, 0.1018, -0.6116, -0.5645, 0.3532, -0.5335, 0.0118, 0.1614, 0.3730, 0.4103, -0.0311, -0.1103, 0.0696, -0.3656, -0.2722, -0.3591, -0.4045, 0.1104, -0.1659, -0.2565]])

Both vectors are clearly similar, but there is quite a substantial numerical difference between the two methods.

Does anyone know the origin behind this difference, and is there any solution to make PyTorch give the same output as TensorFlow?

EDIT V2

The result seems to be entirely due to the order of matrix multiplication in the step where h^ is calculated: https://pytorch.org/docs/stable/generated/torch.nn.GRU.html

TensorFlow 1.x uses the original implementation, but PyTorch does not allow for switching to this. I will try to make a custom GRU PyTorch implementation to mimic this behavior, since as far as I'm aware this does not exist yet.

Why are the outputs from PyTorch and TensorFlow GRU layers not equivalent?

Answers (1)

Related Questions