Reputation: 1226
In order to learn more about Machine Learning algorithms, I am playing with some data I collected myself but I have a strange behaviour when used with my Neural Network algorithm...
My dataset is composed of data split into 3 different possible categories (say A is 5% of the dataset, B is 5% of the dataset and C is 90% of the dataset).
When I try with a "small" training-set (~1800 entries), my training-set accuracy is close to 100% (A:99% B:100% C:100% -> quite normal), but my cross-validation-set and test-set accuracy are very bad.
So I tried with a larger training-set (~12000 entries), and my training-set accuracy drastically drops (A:18%, B:28%, C:99%) and the test-set accuracy is indeed still bad.
Then I tried with a medium training-set (~5500 entries), and as expected the training-set is between the both previous results (A:45%, B:78%, C:99%) and the test-set accuracy remains obviously bad.
Do you know what may be the cause of such a result? Is my dataset missing qualitative features that could help it differentiate the 3 categories A,B and C or is there another underlying reason that may explain such results?
configuration of my current Neural Network just in case it gives some hints:
Upvotes: 0
Views: 634
Reputation: 741
You have overfitting for class C because the dimension of the three training sets are very umbalanced (5%, 5% and 90%). This explains, first of all, the cross-validation and test-set low accuracy. Then, when the training set size increase, also the training set accuracy drops because having so many C items, they tend to dramatically modify the network weights, also if you use a small learning rate.
In other words, the weights modifications caused by training on classes A and B are substantially "forgot" by the network, because the modifications caused by training on class C are much more significant.
Upvotes: 1