Farshid Rayhan
Farshid Rayhan

Reputation: 1214

Face alignment in pytorch

I am trying to do face alignment on 300W dataset. I am using ResNet50 and L1 loss for training. My code looks like this.

batch_size = 10
image_size = 128

net = torchvision.models.resnet50(pretrained=True)
num_ftrs = net.fc.in_features
net.fc = nn.Linear(num_ftrs, 136) # 136 because 68 points with 2 dim. so 136= 68*2

def train():
    device = torch.device("cuda:0" if torch.cuda.is_available() else 
           "cpu")

    optimiser = optim.Adam(net.parameters(), lr=0.001, 
            weight_decay=0.0005)

    criterion = L1Loss(reduction='sum')

    for epoch in range(int(0), 200000):
        for batch, data in enumerate(trainloader, 0):
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)

            optimiser.zero_grad()

            outputs = net(inputs).reshape(-1, 68, 2)

            loss = criterion(outputs, labels)
            loss.backward()
            optimiser.step()
            running_loss += loss.item()


            sys.stdout.write(
            '\rTrain Epoch: {} Batch {} avg_Loss_per_batch: {:.2f} 
            '.format(epoch, batch, running_loss/(batch+1)))
            sys.stdout.flush()

The trainloader is with images and points. The ground-truths are shaped as (batch, 68, 2). We have 68 points on the face on 2 dimensional space.

The papers suggests that the ResNet50 should get an error of 10 (metric: pixel) for a 256*256 image with L1 loss. I am getting error around 500-800 on validation set even after 5000 epoch.

I am training images with 256*256 resolution with ground truth of 68 points such as ((x1,y1),(x2,y2)....(x68,y68)) and I have trained over 5000 epoch with many learning rates. My validation code looks like this,

def validater(load_weights=False):
    device = torch.device("cuda:0" if torch.cuda.is_available() else 
          "cpu")
    net.eval()
    net.to(device)

    with torch.no_grad():
        for batch, data in enumerate(testloader, 0):
            inputs, labels = data
            inputs, labels  = inputs.to(device), labels.to(device)

            outputs = net(inputs).reshape(-1, 68, 2)

            loss = criterion(outputs, labels)

            loss2 = np.linalg.norm(labels.to('cpu') - outputs.to('cpu'))

            sys.stdout.write('\rTest Epoch: {} Batch {} total_L1_Loss: 
                {:.2f} avg_L1_Loss_per_img: {:.2f} total_norm_loss: 
                 {:.2f}'.format(
                0, batch, avg_loss, avg_loss/batch/batch_size, 
                avg_loss2))
            sys.stdout.flush()

    print()

What is wrong with my code ?

PS: I normalise the imgs with the following code

    img = cv2.normalize(img, None, alpha=0, beta=1, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_32F)

After 4000 epoch I get outputs like this where yellow dots are ground truth and blue ones are predicted Yellow dots are Ground truth and Blue ones are predicted

Upvotes: 3

Views: 1104

Answers (1)

Shai
Shai

Reputation: 114976

From your output image you can tell the error is smaller on the top left landmarks and grows larger towards the lower right part of the face.
The landmarks you are trying to predict are (x, y) coordinates relative to the top left corner of the image. As you can see, your model's prediction error grows proportionally to the norm of each coordinate. This is not an uncommon phenomenon: when you model predicts a landmark close to the origin (e.g. left eye) it makes "small" predictions as the norm of this landmark is also small, the learned weights are small and therefore the errors are also small. On the other side, when predicting landmarks far from the origin (right part of mouth) the model need to make "large" predictions as the norm of these landmarks is large. Consequently, the trained weights are larger resulting with cruder errors.

To mitigate this, you should pre-process your data (train and test) and normalize the coordinates of the landmarks to be 1:
1. relative to the center of the image
2. relative to image size
That is, instead of (x, y) coordinates in the range of [0, width]x[0, height] you should have the landmarks in the range [-1, 1]x[-1, 1].
After prediction, to get the original coordinates you simply need to shift them and scale them by image size.


1 I am assuming here all faces in the training set are roughly the same size and located roughly in the center of the images. If your settings are "in the wild" where faces can be of any size at any place in the image I'm afraid this will not work.

Upvotes: 1

Related Questions