How to use stratify for single column

Question

I am very new in this data staff. That's why, I might not be sure what should I write as my question. I am trying to express my issue as simple as possible. I am showing part of my codes.

print(data)

Output:

array([[0, 0, 0, ..., 255, 255, 255],
       [255, 255, 255, ..., 0, 0, 0],
       [255, 255, 255, ..., 255, 255, 255],
       ...,
       [255, 255, 255, ..., 255, 255, 255],
       [255, 255, 255, ..., 255, 255, 255],
       [255, 255, 255, ..., 255, 255, 255]], dtype=object)

print(result)

Output:

['Arrowhead' 'Arrowhead' 'Arrowhead' ... 'Vessel' 'Vessel' 'Vessel']

Converting label to number:

LE = LabelEncoder()
target = LE.fit_transform(result)

print(target)

Output:

[ 0  0  0 ... 38 38 38]

Spliting:

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42, stratify=target)

I got the error:

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

To fix the error, I had to remove stratify, which could be fine for the moment:

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)

To build a CNN, I had to do this:

lb = preprocessing.LabelBinarizer()

y_train_categorical = lb.fit_transform(y_train)
y_test_categorical = lb.fit_transform(y_test)

print(y_train_categorical.shape)
print(y_test_categorical.shape)

Output:

(1945, 38)
(487, 34)

Here is the problem. I need same value for y-axis (y_train_categorical.shape[1] & y_test_categorical.shape[1]). Because, I have applied:

model = Sequential()

model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(100,100,1)))
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(38, activation='softmax'))

which works fine for model.fit():

model.fit(X_train, y_train_categorical, 
          batch_size=32, epochs=5, verbose=1)

but, at the time of evaluating on test,

loss, accuracy = model.evaluate(X_test, y_test_categorical, verbose=0)
print('Loss: ', loss,'
Acc: ', accuracy)

I am getting this error:

ValueError: Error when checking target: expected dense_2 to have shape (38,) but got array with shape (34,)

How can I make y_train_categorical.shape[1] & y_test_categorical.shape[1] same or is there any easy solution to solve my last error (at the time of evaluating the model on test)?

desertnaut · Accepted Answer

In general, irrespectively of the error and methodologically speaking, this:

y_train_categorical = lb.fit_transform(y_train)
y_test_categorical = lb.fit_transform(y_test)

is wrong: we never fit our preprocessing stuff on the test set, we reuse the transformations as fitted in the train set, i.e.:

y_train_categorical = lb.fit_transform(y_train)
y_test_categorical = lb.transform(y_test) # transform only

This may also resolve your error, if all the labels of your test set are present in your train set - which should be the case for a well-formed predictive ML problem (otherwise the problem itself is ill-defined).

If lb.fit_transform(y_test) gives an error saying that it encountered labels not previously present (and encoded), this means exactly that there are new, unseen labels in your test set, and this is the real issue you have to rectify here, and not some coding error.

How to use stratify for single column

Answers (2)

Related Questions