user1896653
user1896653

Reputation: 3327

How to use stratify for single column

I am very new in this data staff. That's why, I might not be sure what should I write as my question. I am trying to express my issue as simple as possible. I am showing part of my codes.

print(data)

Output:

array([[0, 0, 0, ..., 255, 255, 255],
       [255, 255, 255, ..., 0, 0, 0],
       [255, 255, 255, ..., 255, 255, 255],
       ...,
       [255, 255, 255, ..., 255, 255, 255],
       [255, 255, 255, ..., 255, 255, 255],
       [255, 255, 255, ..., 255, 255, 255]], dtype=object)

print(result)

Output:

['Arrowhead' 'Arrowhead' 'Arrowhead' ... 'Vessel' 'Vessel' 'Vessel']

Converting label to number:

LE = LabelEncoder()
target = LE.fit_transform(result)

print(target) 

Output:

[ 0  0  0 ... 38 38 38]

Spliting:

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42, stratify=target)

I got the error:

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

To fix the error, I had to remove stratify, which could be fine for the moment:

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)

To build a CNN, I had to do this:

lb = preprocessing.LabelBinarizer()

y_train_categorical = lb.fit_transform(y_train)
y_test_categorical = lb.fit_transform(y_test)

print(y_train_categorical.shape)
print(y_test_categorical.shape)

Output:

(1945, 38)
(487, 34)

Here is the problem. I need same value for y-axis (y_train_categorical.shape[1] & y_test_categorical.shape[1]). Because, I have applied:

model = Sequential()

model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(100,100,1)))
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(38, activation='softmax'))

which works fine for model.fit():

model.fit(X_train, y_train_categorical, 
          batch_size=32, epochs=5, verbose=1)

but, at the time of evaluating on test,

loss, accuracy = model.evaluate(X_test, y_test_categorical, verbose=0)
print('Loss: ', loss,'\nAcc: ', accuracy)

I am getting this error:

ValueError: Error when checking target: expected dense_2 to have shape (38,) but got array with shape (34,)

How can I make y_train_categorical.shape[1] & y_test_categorical.shape[1] same or is there any easy solution to solve my last error (at the time of evaluating the model on test)?

Upvotes: 0

Views: 403

Answers (2)

nishant
nishant

Reputation: 925

Solution for the error:

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

The error mentions that there is a class in your target variable which has only one occurrence. To explain that, let's consider the below example:

random_list = ['a','a','a','b','b','c','d','d','e','e','e']
LE = LabelEncoder()
target = LE.fit_transform(random_list)
print(target)

gives

array([0, 0, 0, 1, 1, 2, 3, 3, 4, 4, 4])

Now if I try to do a train_test_split, this will throw an error.

train_test_split(target, test_size=0.2, stratify=target)
#ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

This is because I only have one occurrence of 'c' which creates an ambiguity whether to put this in train or test when stratify=True. Therefore for it to work we need to have more than 1 occurrence for every class.

Additional Error with above example

Even if I remove 'c' from the above list, the above solution does not work. We encounter another error.

random_list = ['a','a','a','b','b','d','d','e','e','e']
E = LabelEncoder()
target = LE.fit_transform(random_list) #produces array([0, 0, 0, 1, 1, 3, 3, 4, 4, 4])
train_test_split(target, test_size=0.2, stratify=target)
#ValueError: The test_size = 2 should be greater or equal to the number of classes = 4

For stratify to work successfully, you need to have occurrence of all classes in both train and test. If the number of data_points are not sufficient to create proper distribution, the above error is thrown. For test_size=2, a maximum of 2 classes can be stratified.

Upvotes: 0

desertnaut
desertnaut

Reputation: 60328

In general, irrespectively of the error and methodologically speaking, this:

y_train_categorical = lb.fit_transform(y_train)
y_test_categorical = lb.fit_transform(y_test)

is wrong: we never fit our preprocessing stuff on the test set, we reuse the transformations as fitted in the train set, i.e.:

y_train_categorical = lb.fit_transform(y_train)
y_test_categorical = lb.transform(y_test) # transform only

This may also resolve your error, if all the labels of your test set are present in your train set - which should be the case for a well-formed predictive ML problem (otherwise the problem itself is ill-defined).

If lb.fit_transform(y_test) gives an error saying that it encountered labels not previously present (and encoded), this means exactly that there are new, unseen labels in your test set, and this is the real issue you have to rectify here, and not some coding error.

Upvotes: 1

Related Questions