Reputation: 83
My dataset shape is (91149, 12)
I used CNN to train my classifier in text classification tasks
I found Training Accuracy: 0.5923
and Testing Accuracy: 0.5780
My Class has 9 labels as below:
df['thematique'].value_counts()
Corporate 42399
Economie collaborative 13272
Innovation 11360
Filiale 5990
Richesses Humaines 4445
Relation sociétaire 4363
Communication 4141
Produits et services 2594
Sites Internet et applis 2585
The model structure:
model = Sequential()
embedding_layer = Embedding(vocab_size, 300, weights=[embedding_matrix], input_length=maxlen , trainable=False)
model.add(embedding_layer)
model.add(Conv1D(128, 7, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(9, activation='sigmoid'))
model.compile(optimizer='Adam', loss='categorical_crossentropy', metrics= ['categorical_accuracy'])
My data for multilabel classification is imbalanced. I need to handle imbalanced data for multipabel classification using CNN in Keras.
Upvotes: 6
Views: 2587
Reputation: 1557
Accuracy could be misleading as a metric for your problem, with high class imbalance, I would use the F1 score.
As for the loss, you could use the focal loss it is an variant of the categorical cross-entropy that focuses on the least represented classes. You can find an example here, in my experience, it helps a lot with little classes on NLP classification tasks.
Upvotes: 2
Reputation: 591
I am not sure that you need to handle the imbalance issue using in particular Keras per se, rather than using some intuition. One simple way to do so is to use the same amount of data per each class. Of course, that causes another problem, which is that you filter a lot of samples. But still is a thing that you can check. Of course, when you have imbalance data it is not a very good idea to just calculate the classification performance since it does so well how each class performs.
You should further, calculate the confusion matrix, in order to visualize how well each class performs individually. A more detailed approach to imbalanced data issues could be found in this blog and in here.
The most important is to use the right tools to evaluate the performance of your classification, and also handle the input data as proposed in the links I mentioned.
Upvotes: 2