Aalok_G
Aalok_G

Reputation: 501

Leave one out accuracy for multi class classification

I am a bit confused about how to use the leave one out (LOO) method for calculating accuracy in the case of a multi-class, one v/s rest classification. I am working on the YUPENN Dynamic Scene Recognition dataset which contains 14 categories with 30 videos in each category (a total of 420 videos). Lets name the 14 classes as {A,B,C,D,E,F,G,H,I,J,K,L,M,N}.

I am using linear SVM for one v/s rest classification. Lets say I want to find the accuracy result for class 'A'. When I perform 'A' v/s 'rest', I need to exclude one video while training and test the model on the video I excluded. This video that I exclude, should it be from class A or should it be from all the classes.

In other words, for finding the accuracy of class 'A', should I perform SVM with LOO 30 times(leaving each video from class 'A' exactly once) or should I perform it 420 times(leaving videos from all the classes exactly once).

I have a feeling that I got this all mixed up ?? Can anyone provide me a short schematic of the right way to perform multi-class classification using LOO ??
Also how do I perform this using libsvm on Matlab ?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The no of videos in the dataset is small, and thus I can't afford to create a separate TEST set (which was supposed to be sent to Neptune). Instead I have to ensure that I make full utilization of the dataset, because each video provides some new/unique information. In scenarios like this I have read that people use LOO as a measure of accuracy (when we can't afford an isolated TEST set). They call it as the Leave-One-Video-Out-experiment.

The people who have worked on Dynamic Scene Recognition have used this methodology for testing accuracy. In order to compare the accuracy of my method against their method, I need to use the same evaluation process. But they have just mentioned that they are using LOVO for accuracy. Not much detail apart from that is provided. I am a newbie in this field and thus it is a bit confusing.

According to what I can think of, LOVO can be done in two ways:

1) leave one video out of 420 videos. Train 14 'one-v/s-rest' classifiers using 419 videos as the training set.('A' v/s 'rest', 'B' v/s 'rest', ........'N' v/s 'rest').

Evaluate the left out video using the 14 classifiers. Label it with the class which gives maximum confidence score. Thus one video is classified. We follow the same procedure for labelling all the 420 videos. Using these 420 labels we can find the confusion matrix, find out the false positives/negatives, precision,recall, etc.

2) From each of the 14 classes I leave one video. Which means I choose 406 videos for training and 14 for testing. Using the 406 videos I find out the 14 'one-v/s-rest' classifiers. I evaluate each of the 14 videos in the test set and give them labels based on maximum confidence score. In the next round I again leave out 14 videos, one from each class. But this time the set of 14 is such that, none of them were left out in the previous round. I again train and evaluate the 14 videos and find out labels. In this way, I carry on this process 30 times, with a non-repeating set of 14 videos each time. In the end all 420 videos are labelled. In this case as well, I calculate confusion matrix, accuracy, precision, and recall, etc.

Apart from these two methods, LOVO could be done in many other different style. In the papers on Dynamic Scene Recognition they have not mentioned how they are performing the LOVO. Is it safe to assume that they are using the 1st method ? Is there any way of deciding which method would be better? Would there be significant difference in the accuracies obtained by the two methods ?



Following are some of the recent papers on Dynamic Scene Recognition for reference purpose. In the evaluation section they have mentioned about LOVO. 1)http://www.cse.yorku.ca/vision/publications/FeichtenhoferPinzWildesCVPR2014.pdf 2)http://www.cse.yorku.ca/~wildes/wildesBMVC2013b.pdf 3)http://www.seas.upenn.edu/~derpanis/derpanis_lecce_daniilidis_wildes_CVPR_2012.pdf 4)http://webia.lip6.fr/~thomen/papers/Theriault_CVPR_2013.pdf 5)http://www.umiacs.umd.edu/~nshroff/DynScene.pdf

Upvotes: 3

Views: 2780

Answers (2)

khan
khan

Reputation: 531

In short, the papers you mentioned use second criteria, i.e. leaving one video from each class that makes 14 videos for testing and the rest for training.

Upvotes: 0

ely
ely

Reputation: 77484

When using cross validation it is good to keep in mind that it applies to training a model, and not usually to the honest-to-god, end-of-the-whole-thing measures of accuracy, which are instead reserved for measures of classification accuracy on a testing set that has not been touched at all or involved in any way during training.

Let's focus just on one single classifier that you plan to build. The "A vs. rest" classifier. You are going to separate all of the data into a training set and a testing set, and then you are going to put the testing set in a cardboard box, staple it shut, cover it with duct tape, place it in a titanium vault, and attach it to a NASA rocket that will deposit it in the ice covered oceans of Neptune.

Then let's look at the training set. When we train with the training set, we'd like to leave some of the training data to the side, just for calibrating, but not as part of official Neptune ocean test set.

So what we can do is tell every data point (in your case it appears that a data point is a video-valued object) to sit out once. We don't care if it comes from class A or not. So if there are 420 videos which would be used in the training set for just the "A vs. rest" classifier, the yeah, you're going to fit 420 different SVMs.

And in fact, if you are tweaking parameters for the SVM, this is where you'll do it. For example, if you're trying to choose a penalty term or a coefficient in a polynomial kernel or something, then you will repeat the entire training process (yep, all 420 different trained SVMs) for all of the combinations of parameters you want to search through. And for each collection of parameters, you will associate with it the sum of the accuracy scores from the 420 LOO trained classifiers.

Once that's all done, you choose the parameter set with the best LOO score, and voila, that is you 'A vs. rest' classifier. Rinse and repeat for "B vs. rest" and so on.

With all of this going on, there is rightfully a big worry that you are overfitting the data. Especially if many of the "negative" samples have to be repeated from class to class.

But, this is why you sent that testing set to Neptune. Once you finish with all of the LOO-based parameter-swept SVMs and you've got the final classifier in place, now you execute that classifier across you actual test set (from Neptune) and that will tell you if the entire thing is showing efficacy in predicting on unseen data.

This whole exercise is obviously computationally expensive. So instead people will sometimes use Leave-P-Out, where P is much larger than 1. And instead of repeating that process until all of the samples have spent some time in a left-out group, they will just repeat it a "reasonable" number of times, for various definitions of reasonable.

In the Leave-P-Out situation, there are some algorithms which do allow you sample which points are left out in a way that represents the classes fairly. So if the "A" samples make up 40 % of the data, you might want them to take up about 40% of the leave-out set.

This doesn't really apply for LOO, for two reasons: (1) you're almost always going to perform LOO on every training data point, so trying to sample them in a fancy way would be irrelevant if they are all going to end up being used exactly once. (2) If you plan to use LOO for some number of times that is smaller than the sample size (not usually recommended), then just drawing points randomly from the set will naturally reflect the relative frequencies of the classes, and so if you planned to do LOO for K times, then simple taking a random size-K subsample of the training set, and doing regular LOO on those, would suffice.

Upvotes: 3

Related Questions