Training set accuracy deacreases as set size increases

Question

In order to learn more about Machine Learning algorithms, I am playing with some data I collected myself but I have a strange behaviour when used with my Neural Network algorithm...

My dataset is composed of data split into 3 different possible categories (say A is 5% of the dataset, B is 5% of the dataset and C is 90% of the dataset).

When I try with a "small" training-set (~1800 entries), my training-set accuracy is close to 100% (A:99% B:100% C:100% -> quite normal), but my cross-validation-set and test-set accuracy are very bad.

So I tried with a larger training-set (~12000 entries), and my training-set accuracy drastically drops (A:18%, B:28%, C:99%) and the test-set accuracy is indeed still bad.

Then I tried with a medium training-set (~5500 entries), and as expected the training-set is between the both previous results (A:45%, B:78%, C:99%) and the test-set accuracy remains obviously bad.

Do you know what may be the cause of such a result? Is my dataset missing qualitative features that could help it differentiate the 3 categories A,B and C or is there another underlying reason that may explain such results?

configuration of my current Neural Network just in case it gives some hints:

hidden layers: 1
number of activation units: twice the number of features
lambda: 2.0

Davide Visentin · Accepted Answer

You have overfitting for class C because the dimension of the three training sets are very umbalanced (5%, 5% and 90%). This explains, first of all, the cross-validation and test-set low accuracy. Then, when the training set size increase, also the training set accuracy drops because having so many C items, they tend to dramatically modify the network weights, also if you use a small learning rate.

In other words, the weights modifications caused by training on classes A and B are substantially "forgot" by the network, because the modifications caused by training on class C are much more significant.

Training set accuracy deacreases as set size increases

Answers (1)

Related Questions