Reputation: 803
I am new in machine learning. I did a test but do not know how to explain and evaluate.
Case 1:
I first divide randomly the data (data A, about 8000 words) into 10 groups (a1..a10). Within each group, I use 90% of data to build ngram model. This ngram model is then tested on the other 10% data of the same group. The result is below 10% accuracy. Other 9 groups are done same way (respectively build model and respectively tested on the remained 10% data of that group). All results are about 10% accuracy. (Is this 10 fold cross-validation?)
Case 2:
I first build a ngram model based on entire data set (data A) of about 8000 words. Then I divide this A into 10 groups(a1,a2,a3..a10), randomly of course. I then use this ngram to test respectively a1,a2..a10. I found the model is almost 96% accuracy on all groups.
How to explain such situations. Thanks in advance.
Upvotes: 0
Views: 164
Reputation: 77454
You need to read up on the topic of overfitting.
The situation you describes gives the impression that your ngram model is heavily overfitted: it can "memorize" 96% of the training data. But when trained on a proper subset, it only achieves a prediction on the unknown data of 10%.
Upvotes: 2
Reputation: 61044
Yes, 10-fold cross validation.
This testing method has the common flaw of testing on the training set. That is why the accuracy is inflated. It is unrealistic because, in real life, your test instances are novel and previously unseen by the system.
N-fold cross validation is a valid evaluation method used in many works.
Upvotes: 3