Reputation: 949
I am going to use ANN for my work in which I have a large dataset, let say input[600x40] and output[600x6]. As one can see, the number of inputs (40) is too high for ANN and it may trap in local minimum and/or increases the CPU time dramatically. Is there any way to select the most informative input?
As my first try, I used the following code in Matlab to find the cross-correlation between each two inputs:
[rho, ~] = corr(inputs, 'rows','pairwise')
However, I think this simple correlation cannot identify some hidden complex relation between the inputs.
Any ideas?
Upvotes: 0
Views: 405
Reputation: 3194
Disclaimer: I'm with lejlot on this - you should get more data and more features instead of trying to remove features. Still, that doesn't answer your question, so here we go.
Try most basic greedy approach - try removing each feature and retrain your ANN (several times, of course) and see if your results got better or worse. Choose this situation where results got better and improvement was the best. Repeat until you'll get no improvement by removing features. This will take a lot of time, so you may want to try doing it on some subset of your data (for example on 3 folds of dataset splitted into 10 folds).
It's ugly, but sometimes it works.
I repeat what I've said in disclaimer - this is not the way to go.
Upvotes: 0
Reputation: 66795
First of all 40 inputs is a very small space and it should not be reduced. Large number of inputs is 100,000, not 40. Also, 600x40 is not a big dataset, nor the one "increasing the CPU time dramaticaly", if it learns slowly than check your code because it appears to be the problem, not your data.
Furthermore, feature selection is not a good way to go, you should use it only when gathering features is actually expensive. In any other scenario you are looking for dimensionality reduction, such as PCA, LDA etc. although as said before - your data should not be reduced, rather - you should consider getting more of it (new samples/new features).
Upvotes: 1