Reputation: 81
I am learning StyleGAN architecture and I got confused about the purpose of the Mapping Network. In the original paper it says:
Our mapping network consists of 8 fully-connected layers, and the dimensionality of all input and output activations— including z and w — is 512.
And there is no information about this network being trained in any way.
Like, wouldn’t it just generate some nonsense values?
I've tried creating a network like that (but with a smaller shape (16,)
):
import tensorflow as tf
import numpy as np
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(16)))
for i in range(7):
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.compile()
and then evaluated it on some random values:
g = tf.random.Generator.from_seed(34)
model(
g.normal(shape=(16, 16))
)
And I am getting some random outputs like:
array([[0. , 0.01045225, 0. , 0. , 0.02217731,
0.00940356, 0.02321716, 0.00556996, 0. , 0. ,
0. , 0.03117323, 0. , 0. , 0.00734158,
0. ],
[0.03159791, 0.05680077, 0. , 0. , 0. ,
0. , 0.05907414, 0. , 0. , 0. ,
0. , 0. , 0.03110216, 0.04647615, 0. ,
0.04566741],
.
. # More similar vectors goes there
.
[0. , 0.01229661, 0.00056016, 0. , 0.03534952,
0.02654905, 0.03212402, 0. , 0. , 0. ,
0. , 0.0913604 , 0. , 0. , 0. ,
0. ]], dtype=float32)>
What am I missing? Is there any information on the Internet about training Mapping Network? Any math explanation? Got really confused :(
Upvotes: 1
Views: 1863
Reputation: 2945
TLDR: Mapping network amis to disentangle feature space and it can be either integrated into the generator or implemented as a separate network. For more detailed information, you can refer to the following blog post: Turning StyleGAN into a latent feature extractor.
A typical GAN takes a random noise as input, denoted as z
. We need the noise to get different images each time we want to generate a new one. Otherwise, we would get the same image.
Over time, the GAN learns to map each value in z
to high-level features in the images. Usually, the noise is normally distributed (each value in z
has a range [-1, 1]
with zero mean), as the assumption is that the latent features are normally distributed.
The idea is that the values in z
represent abstract features, but latent space Z
becomes entangled as we model a complex real distribution of features using a simple normal distribution. Feature space Z
reproduces the real statistics of the features and, consequently, each value in z
becomes a mix of multiple features (e.g., the feature beard
and male
often appear together).
To solve this problem, StyleGAN introduces a mapping network. It transforms entangled feature space Z
into disentangled feature space W
Here is an illustration of this problem:
Let's denote the mapping network as F
, the generator as G
, and the discriminator as D
. The output will be D(G(F(z)))
. So, technically we can train each network separately. The only thing we need is the backpropagated error from the previous network. For example, to update D
we just need an input image, G
needs gradients from the D
, and F
needs gradients from G
.
Here is an illustaration for D(G(z))
:
Upvotes: 1
Reputation: 116
I'm going to try a visual explanation of the "disentanglement" concept in context of the mapping network in StyleGAN.
Setting
In the figure below, lets consider the task of generative modeling of human faces. Here, we have the prior latent space z and the learned posterior w. (the terms prior and posterior are not strictly accurate to use here) We also consider two "factors" relevant to human faces i.e. hair and eyes.
Explanation
In z we see that the subspaces of hair and eye factors are mixed, while in w they are "disentangled". Since there exists a non-linear mapping (i.e. the fully connected layers) between z and w, sampling a point in z gives us a point in w. The difference is that the point in w is now encoding for 1 specific factor compared to z that is (possibly) encoding for two or more factors.
This disentaglement gives us more smoother control as we traverse the latent space. Hence, the images produced by such a traversal have gradual variations that are more understandable to a human reader.
Update On second thought, the feature subspaces in w would be more like orthogonal lines (in contrast to the blobby spaces shown in my diagram). But an interesting aspect to think about is how the mapping network does this without having any explicit supervision for disentanglement. Surprisingly the regular GAN gradients are already able to create such a feature space.
Upvotes: 1
Reputation: 4258
As I understand the mapping network is not trained separately. It it part of generator network and adjusts weights based on gradients just like other parts of the network.
In their stylegan generator code implementation it written the Generator is composed of two sub networks one mapping and another synthesis. In stylegan3 generator source it is much easier to see. The output of mapping is passed to synthesis network which generates image.
class Generator(torch.nn.Module):
...
def forward(self, z, ...):
ws = self.mapping(z, ...)
img = self.synthesis(ws, ...)
return img
The diagram below shows mapping network from stylegan 2019 paper. Section 2 describes about mapping network.
Mapping layer is represented with f
in the paper that takes noise vector z
initialized from normal distribution and maps to intermediate latent representation w
. It is implemented with 8 layer MLP. Stylegan mapping network implementation has MLP layers set to 8.
In section 4 they mention,
a common goal is a latent space that consists of linear subspaces, each of which controls one factor of variation. However, the sampling probability of each combination of factors in
Z
needs to match the corresponding density in the training data.
A major benefit of our generator architecture is that the intermediate latent space
W
does not have to support sampling according to any fixed distribution.
So, z
and w
have same dimensions but w
is more disentangled than z
. Finding a w
from intermediate latent space W
for an image allows specific image editing.
From Encoder for Editing paper,
In stylegan2-ada paper with other changes they found mapping network depth of 2 better than 8. In stylegan3 mapping layer code implementation default number of layers in mapping is set to 2.
Upvotes: 4