Reputation: 1217
I am facing the following problem. I have a system able to produce a ranking of some operations according to their anomaly score. To improve the performance I implemented a genetic algorithm to perform a features selection, such that the most anomalous operations appears in the first positions. What I am doing is not exactly feature selection, because I am not using binary variables, rather float variables between 0-1, which sum is equal to 1.
Currently, I have a population of 200 individuals for 50 generations. I am using as the evaluation function the system itself and I evaluate the quality of the solution by using the true positive rate, counting how many anomalous operations appears in the first N positions (where N is the number of anomalous operations). Then as operator the uniform crossover and I change a valueof a cell of the individual for the mutation. Of course, every time I make a check to fix the individual such that the sum is 1. Finally I use elitism to save the best-so-far solution over the time.
I observed that one feature has a very high value, which is often important, but not always, and this causes very low values for the other features. I suspect that my GA is overfitting. Can you help me to find a good stop criteria?
Upvotes: 8
Views: 5952
Reputation: 8401
Overfitting in genetic algorithms and programming is a big issue which is currently under research focus of the GP community, including myself. Most of the research is aimed at genetic programming and evolution of classification/regression models but it might also relate to your problem. There are some papers which might help you (and which I am working with too):
You can find the papers (the first two directly in pdf) by searching for their titles in scholar.google.com.
Basically, what all the papers work with, is the idea of using only a subset of the training data for directing the evolution and (randomly) changing this subset every generation (using the same subset for all individuals in one generation). Interestingly, experiments show that the smaller this subset is, the less overfitting occurs, up to the extreme of using only a single-element subset. The papers work with this idea and extend it with some tweaks (like switching between full dataset and a subset). But as I said in the beginning, all this is aimed at symbolic regression (more or less) and not feature selection.
I personally once tried another approach (again for symbolic regression by genetic programming) - using a subset of training data (e.g. a half) to drive the evolution (i.e. for fitness), but the "best-so-far" solution was determined using results on the remaining training data. The overfitting was much less significant.
Upvotes: 19