How to build a Face recognition system from scratch?

Question

I am building a prototype for a face recognition system and while writing the algorithm, I had a few questions.

Algorithm:

Collect pair of (A(i),P(i),N(i)) -set of the anchor, positive, negative images of employees working at XYX company.
Using gradient descent train the Triplet loss function to learn CNN parameters. Actually, here I am training a Siamese network(Idea of running two identical CNNs' on 2 different inputs[one time on A(i)-P(i) and next A(i)-N(i)] and then comparing them).

These learned parameters will ensure that the distance between the flattened n-dim encoding of the same images would be small and different image would be large.!
Now, create a database wherein you will store the encoding of each training image of XYX company's employees!

Simply make a forward pass through the trained CNN and store the corresponding encoding of each image in the database
At test time, you have the image of an XYX company's employee and image of an outsider!
- You will pass both of the test images through the CNN and get the corresponding encodings!
- Now, The question comes that how would you find the similarity between the test-picture-encoding and all the training-picture-encoding in the database?
  - First question, Would you do cosine similarity or I need to do something else? Can you add more clarity on it?
  - Second question, Also, in terms of efficiency, how would you handle a scenario wherein you have 100,000 employees training-picture-encoding in the database present and for every new person you need to look these 100,000 encodings and compute cosine similarity and give result in <2 secs? Any suggestion on this part?
- Third question usually for face recognition task if we use approach(Image-->CNN-->SoftMax--> output), Each time a new person joins your organization you need to retrain your network, that's why it's a bad approach!
- This problem can be mitigated by using the 2nd approach wherein we are using a learned distance function "d(img1, img2)" over a pair of images of employees as stated above on in point 1 to 3.
- - My question is in case of a new employee joining the organization, How this learned distance function would be able to generalize when it was not been used in the training set at all? Isn't a problem of changed data distribution of test and train set? Any suggestion in this regards

Could anyone help in understanding these conceptual glitches?

Anu · Accepted Answer

After doing some literature survey on Face verification and recognition/detection research papers in computer vision. I think I get an answer to all of my questions, So I am trying to answer it here.

First question, Would you do cosine similarity? Can you add more clarity on it?

Find the minimum distance between the test & every saved train image enc by simply computing a Euclidean distance between them.
Not keep a threshold say 0.7 and is the minimum distance is < 0.7 return the name of the employee else "not in the database error!"

Second question, Also, in terms of efficiency, how would you handle a scenario wherein you have 100,000 employees training-picture-encoding in the database present and for every new person you need to look these 100,000 encodings and compute cosine similarity and give result in <2 secs?

It should be noted, that during training a 128-dimensional float vector is used, but it can be quantized to 128-bytes without loss of accuracy. Thus each face is compactly represented by a 128-dimensional byte vector, which is ideal for large scale clustering and recognition. Smaller embeddings are possible at a minor loss of accuracy and could be employed on mobile devices

Third question: - First of all, we are learning the network parameters of the deep CNN(Siamese n/w) by minimizing the triplet loss function!

Second, it's been assumed that you have trained these model weights on a huge dataset of millions of people that these weights have learned both higher level features such as the identity of a person, sex etc.! As well as low-level features such as edges relevant to human faces.

Now, there is an assumption that these model parameters together can represent any human face at least!, so you will go ahead and save the "new person" encoding in the database by making forward pass through your network and later, use answer 1 to compute whether the person belongs to organization or not(face recognition problem). Moreover, In the FaceNet paper it's mentioned that we keep a holdout set of around one million images, that has the same distribution as our training set, but disjoint identities.

Third How these two techniques are different is the way we are training these model weights in the first technique using a loss function: cross entropy softmax vs in second technique loss function: triplet loss function!

How to build a Face recognition system from scratch?

Answers (1)

Related Questions