JWLaser
JWLaser

Reputation: 11

Dataset spliting in semantic segmentation

I'm working on a biomedical image segmentation task. The data I got from the hospital has been split into training set, validation set, test set. But I'm confused about the splitting.

The data consists of images from different patients, each patient has 2 ~ 3 2D slices taken from the 3D image. For one patient, his or her 2 ~ 3 slices are adjacent or close to each other, which means that these slices have only very small differences that are barely visible to the naked eye. These 2~3 slices are split into training set, test set and validation set. So the proportion of training set, test set and validation set is close to 1:1:1.

However, the splitting of medical image datasets I found is mostly based on patients. Three sets are slices of different patients, instead of distributing the slices of the same patient into three sets like the hospital's way. I'll give an example.

Example

Let i_j be the j-th slice of the i-th patient, and i_j & i_j+1 are adjacent slices. All slice numbers are as follows:

1_1 1_2 1_3 / 2_1 2_2 2_3 / 3_1 3_2 3_3 / 4_1 4_2 / 5_1 5_2

A possible splitting in the hospital's way:

Train: 1_1 2_1 3_1 4_1 5_1 Val: 1_2 2_2 3_2 4_2 Test: 1_3 2_3 3_3 5_2

A possible splitting in my way:

Train: 1_1 1_2 1_3 2_1 2_2 2_3 4_1 4_2 Val: 3_1 3_2 3_3 Test: 5_1 5_2

I think in the first way, the training set, validation set and test set are actually too similar. This will make the accuracy rate of the validation set and training set higher, but the generalization ability of the model will be worse. So which splitting method is correct? Or both are OK?

Upvotes: 1

Views: 1083

Answers (1)

jss367
jss367

Reputation: 5381

Your way is definitely the right way to go. The hospital's method will lead to massive overfitting for exactly the reasons you specified.

Upvotes: 1

Related Questions