Reputation: 71
I am trying to reproduce the training process of dlib's frontal_face_detector(). I am using the very same dataset (from http://dlib.net/files/data/dlib_face_detector_training_data.tar.gz) as dlib say they used, by union of frontal and profile faces + their reflections.
My problems are: 1. Very high memory usage for the whole dataset (30+Gb) 2. Training on partial dataset does not yield very high recall rate, 50-60 percent as compared to frontal_face_detector's 80-90 (testing on sub-set of images not used for training). 3. The detectors work badly on low resolution images and thus fail in detecting faces that are more than 1-1.5 meters deep. 4. Training run time increases significantly with SVM's C parameter that I have to increase to achieve better recall rate (I suspect that this is just overfitting artifact)
My original motivation in trainig was a. gaining the ability to adapt to the specific environment where the camera is installed by e.g. hard negative mining. b. improving detection in depth + run time by reducing the 80x80 window to 64x64 or even 48x48.
Am I on the right path? Do I miss anything? Please help...
Upvotes: 4
Views: 1392
Reputation: 4791
The training parameters used were recorded in a comment in dlib's code here http://dlib.net/dlib/image_processing/frontal_face_detector.h.html. For reference:
It is built out of 5 HOG filters. A front looking, left looking, right looking,
front looking but rotated left, and finally a front looking but rotated right one.
Moreover, here is the training log and parameters used to generate the filters:
The front detector:
trained on mirrored set of labeled_faces_in_the_wild/frontal_faces.xml
upsampled each image by 2:1
used pyramid_down<6>
loss per missed target: 1
epsilon: 0.05
padding: 0
detection window size: 80 80
C: 700
nuclear norm regularizer: 9
cell_size: 8
num filters: 78
num images: 4748
Train detector (precision,recall,AP): 0.999793 0.895517 0.895368
singular value threshold: 0.15
The left detector:
trained on labeled_faces_in_the_wild/left_faces.xml
upsampled each image by 2:1
used pyramid_down<6>
loss per missed target: 2
epsilon: 0.05
padding: 0
detection window size: 80 80
C: 250
nuclear norm regularizer: 8
cell_size: 8
num filters: 63
num images: 493
Train detector (precision,recall,AP): 0.991803 0.86019 0.859486
singular value threshold: 0.15
The right detector:
trained left-right flip of labeled_faces_in_the_wild/left_faces.xml
upsampled each image by 2:1
used pyramid_down<6>
loss per missed target: 2
epsilon: 0.05
padding: 0
detection window size: 80 80
C: 250
nuclear norm regularizer: 8
cell_size: 8
num filters: 66
num images: 493
Train detector (precision,recall,AP): 0.991781 0.85782 0.857341
singular value threshold: 0.19
The front-rotate-left detector:
trained on mirrored set of labeled_faces_in_the_wild/frontal_faces.xml
upsampled each image by 2:1
used pyramid_down<6>
rotated left 27 degrees
loss per missed target: 1
epsilon: 0.05
padding: 0
detection window size: 80 80
C: 700
nuclear norm regularizer: 9
cell_size: 8
num images: 4748
singular value threshold: 0.12
The front-rotate-right detector:
trained on mirrored set of labeled_faces_in_the_wild/frontal_faces.xml
upsampled each image by 2:1
used pyramid_down<6>
rotated right 27 degrees
loss per missed target: 1
epsilon: 0.05
padding: 0
detection window size: 80 80
C: 700
nuclear norm regularizer: 9
cell_size: 8
num filters: 89
num images: 4748
Train detector (precision,recall,AP): 1 0.897369 0.897369
singular value threshold: 0.15
What the parameters are and how to set them is all explained in the dlib documentation. There is also a paper that describes the training algorithm: Max-Margin Object Detection.
Yes, it can take a lot of RAM to run the trainer.
Upvotes: 4