Waranthorn Chansawang
Waranthorn Chansawang

Reputation: 99

Is it okay if we augment the data first then randomly choose the data and split the data afterward?

I am doing a science project about classifying medical images but I do not have a lot of data so, is it okay if I augment the data first then randomly select the data to keep and split the kept data afterward? At first, my teacher told me to augment the data first then split the data into train, validation, and test. But I think my proposed method will make the training dataset collide with the testing dataset which will cause the accuracy to be unrealistic(way too high), so I thought my method that randomly chooses the files after doing data augmentation should help the augmented dataset to not be too similar to each other and solve the imbalanced amount of dataset problem.

Upvotes: 0

Views: 345

Answers (1)

terraCoder
terraCoder

Reputation: 65

We want our model to generalize well on training set, so technically, we should do data augmentation only on the training set. I would suggest that you split your data-set into training, validation and testing, then do data augmentation only on training set.

Upvotes: 0

Related Questions