Training Loss optimisation for blurriness in GAN-based mouth generation AI model

Question

I need help training a GAN model for mouth area in painting (with audio and reference frames for lip syncing).

At the moment it gives good result for the lip shapes and filling the cheeks, but overall the mouth is quite blurry, especially for the teeth part.

The current dataset is constituted of 1500+ high quality videos ranging from 20s to 45s with different person on each video.

The current losses:

GAN adversarial loss for each frame.

VGG loss

Model input:

Image sequence with masked mouth area.

Wav2vec latent vector from the audio.

5 random reference frame of the face masked so we only see the mouth area.

I was wondering if I was missing something for the training. A type of loss, a method to improve the result ?

I tried a landmark based loss but it slows down a lot the training.

I was thinking of training a second model for enhancing/upscaling the output of the first one.

What do you think ?

The examples below are the result after fine tuning for about 120 iterations (of about 4 frames each), without the sobel loss.

What I'm trying:

Adding a MSE loss on the sobel filter preds and source. Seems to decrease slightly the blurriness and allows for better teeth representation.

Fine tuning the model on a specific face for a few iteration.

Answers (0)